Re: Solr frontend for nutch schema

2011-07-27 Thread Way Cool
I customized Solr browse GUI for Nutch based on Solr 3.3. Here is the link to the war file I created as well as instructions: http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html Have fun! On 7/20/11, Markus Jelsma markus.jel...@openindex.io wrote: There are

Re: nutch 1.3 + solr server

2011-07-27 Thread Way Cool
As promised, I customized Solr browse GUI for Nutch based on Solr 3.3. Here is the link to the war file I created as well as instructions: http://thetechietutorials.blogspot.com/2011/07/customized-solr-browser-interface-for.html Have fun with Nutch and Solr! On 7/26/11, Geek Gamer

HtmlParser performance

2011-07-27 Thread Cam Bazz
Hello, I am modifiying htmlparser for my own purposes. After lots of coding and testing, I pretty much know what to do. I was wondering, if we were lets say lingpipe library to do some named entity recognition at parse stage. Many libraries such as lingpipe, but not limited to lingpipe have some

pages that load with javascript

2011-07-27 Thread Cam Bazz
Hello, Looking at my crawler output, I noticed that some pages are not captured, because they do some sort of js loading on pageLoad() - these are not per se - lets say an ajax request to get some json, and render it with in dom with js - however these are XHR calls that return plain html. Could

Re: Storage of data between crawls

2011-07-27 Thread lewis john mcgibbney
HI Alexander, I don't want to state the obvious here but this will depend directly on what type of loading your Nutch implementation deals with... You are correct in stating that we store data in segments, namely /crawl_fetch /content /crawl_parse /parse_data /crawl_generate /parse_text I

Re: Nutch not indexing full collection

2011-07-27 Thread lewis john mcgibbney
has this been solved? If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr. On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote: I'm still having trouble. I've set a windows environment variable,

Re: TF in wide internet crawls

2011-07-27 Thread lewis john mcgibbney
Hi Markus, I am getting you until the last parts of your comments. cope with non-edited... edited by whom? and for what purpose? To give a better relative tf score... To comment on the first part, and please ignore or correct me if I am wrong, but do we not give each page and therefore each

Re: solrindex command` not working

2011-07-27 Thread Way Cool
To be honest, I am not a Nutch guru. If I were you, I would run solrindex for each segment one by one for now (or write a script to automate that). Solr will combine results for each segment. I tested that before. You have to test it by yourself because your system is in production. :-) Down the

Re: plugin build.xml file

2011-07-27 Thread lewis john mcgibbney
Hi Cheng Li, Please experiment with this. We have been gradually getting the pluginCentral section of the wiki updated as it needed a total face lift, so would appreciate any additional input you may have for updating the writing Plugin example which is already there. Apart being completely out

Re: Limit Nutch memory usage

2011-07-27 Thread lewis john mcgibbney
Hi Marseld, I'm just putting my thoughts out here, however Hadoop is not shipped with Nutch 1.3 anymore therefore I don't know where you would set this specific property within yout Nutch instances... How are you running Hadoop what version of Nutch what mode are you running Nutch in? On Tue,

Re: How to perform a search in Nutch

2011-07-27 Thread Way Cool
You can run jetty (for example, mvn jetty:run, or java -jar start.jar for solr. :-) On Sun, Jul 24, 2011 at 5:12 AM, Markus Jelsma markus.jel...@openindex.iowrote: You need Solr for indexing. Hi Everyone I have Nutch-Gora-Hbase configuration and I've crawled some urls. I want to perform

Re: How to use lucene to index Nutch 1.3 data

2011-07-27 Thread Way Cool
And Solr is generating lucene index anyway. 2011/7/19 lewis john mcgibbney lewis.mcgibb...@gmail.com Hi Kelvin, I see you are posting on a couple of threads with regards to the Lucene index generated by Nutch which you correctly point out is not there. It is not possible to create a Lucene