thi is the problem! Becaus in my root there is a url! I write you my step-by-step configuration of nutch: (I use cygwin because I work on windows)
*1. Extract the Nutch package* *2. Configure Solr* (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) for *to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: *b. Change schema.xml so that the stored attribute of field “content” is true.* *<field name=”content” type=”text” stored=”true” indexed=”true”/>* We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case: *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it* <requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <bool hl="true"/> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler> *3. Start Solr* cd apache-solr-1.3.0/example java -jar start.jar *4. Configure Nutch* *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :* <?xml version="1.0"?> <configuration> <property> <name>http.agent.name</name> <value>nutch-solr-integration</value> </property> <property> <name>generate.max.per.host</name> <value>100</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> </configuration> *b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace it’s content with following:* -^(https|telnet|file|ftp|mailto): # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV| WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png| PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG |bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # allow urls in foofactory.fi domain +^http:*//([a-z0-9\-A-Z]*\.)*google.it/* # deny anything *else* -. *5. Create a seed list (the initial urls to fetch)* mkdir urls *(crea una cartella ‘urls’)* echo "http://www.google.it/" > urls/seed.txt *6. Inject seed url(s) to nutch crawldb (execute in nutch directory)* bin/nutch inject crawl/crawldb urls AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion? thank you alessio Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in < tamanjit.bin...@yahoo.co.in> ha scritto: > The empty path message is becayse nutch is unable to find a url in the url > location that you provide. > > Kindly ensure there is a url there. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html > Sent from the Solr - User mailing list archive at Nabble.com. >