Re: nutch and solr

alessio crisantemi Sat, 25 Feb 2012 01:53:24 -0800

thi is the problem!
Becaus in my root there is a url!

I write you my step-by-step configuration of nutch:
(I use cygwin because I work on windows)


*1. Extract the Nutch package*

*2. Configure Solr*
(*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to
directory apache-solr-1.3.0/example/solr/conf (override the existing file)
for *to allow Solr to create the snippets for search results so we need to
store the content in addition to indexing it:

*b. Change schema.xml so that the stored attribute of field “content” is
true.*

*<field name=”content” type=”text” stored=”true” indexed=”true”/>*

We want to be able to tweak the relevancy of queries easily so we’ll create
new dismax request handler configuration for our use case:

*d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste
following fragment to it*

<requestHandler name="/nutch" class="solr.SearchHandler" >

<lst name="defaults">

<str name="defType">dismax</str>

<str name="echoParams">explicit</str>

<float name="tie">0.01</float>

<str name="qf">

content^0.5 anchor^1.0 title^1.2

</str>

<str name="pf">

content^0.5 anchor^1.5 title^1.2 site^1.5

</str>

<str name="fl">

url

</str>

<str name="mm">

2&lt;-1 5&lt;-2 6&lt;90%

</str>

<int name="ps">100</int>

<bool hl="true"/>

<str name="q.alt">*:*</str>

<str name="hl.fl">title url content</str>

<str name="f.title.hl.fragsize">0</str>

<str name="f.title.hl.alternateField">title</str>

<str name="f.url.hl.fragsize">0</str>

<str name="f.url.hl.alternateField">url</str>

<str name="f.content.hl.fragmenter">regex</str>

</lst>

</requestHandler>

*3. Start Solr*

cd apache-solr-1.3.0/example

java -jar start.jar

*4. Configure Nutch*

*a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
contents with the following (we specify our crawler name, active plugins
and limit maximum url count for single host per run to be 100) :*

<?xml version="1.0"?>

<configuration>

<property>

<name>http.agent.name</name>

<value>nutch-solr-integration</value>

</property>

<property>

<name>generate.max.per.host</name>

<value>100</value>

</property>

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

</configuration>

*b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace
it’s content with following:*

-^(https|telnet|file|ftp|mailto):



# skip some suffixes

-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|
WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|
PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG
|bmp|BMP)$



# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]



# allow urls in foofactory.fi domain

+^http:*//([a-z0-9\-A-Z]*\.)*google.it/*



# deny anything *else*

-.

*5. Create a seed list (the initial urls to fetch)*

mkdir urls *(crea una cartella ‘urls’)*

echo "http://www.google.it/"; > urls/seed.txt

*6. Inject seed url(s) to nutch crawldb (execute in nutch directory)*

bin/nutch inject crawl/crawldb urls
AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion?
thank you
alessio

Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in <
tamanjit.bin...@yahoo.co.in> ha scritto:

> The empty path message is becayse nutch is unable to find a url in the url
> location that you provide.
>
> Kindly ensure there is a url there.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: nutch and solr

Reply via email to