hi

in plugin.includes value change urlfilter-regex to urlfilter-(crawl|regex)

bhupal


Barry Haddow wrote:
> 
> Hi Bhupal
> 
> The plugin.includes is below - I haven't changed it at all. What should it
> be?
> 
> thanks and regards,
> Barry
> 
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|
> anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
> 
> 
> On Tuesday 29 January 2008, bhupal wrote:
>> Hi,
>>
>> Look at your conf/nutch-default.xml.
>> I think you have not added crawl-urlfilter plugin in plugin-include
>> property.
>>
>> bhupal.
>>
>> Barry Haddow wrote:
>> > Hi
>> >
>> > I'm try to get the nutch/hadoop example from
>> > http://wiki.apache.org/nutch/NutchHadoopTutorial
>> > running.
>> >
>> > I've set up the urllist.txm and the crawl-urlfilter.xml as instructed
>> in
>> > the
>> > tutorial, but whenever I run the crawl it either reports
>> >
>> > Generator: 0 records selected for fetching, exiting ...
>> > Stopping at depth=1 - no more URLs to fetch.
>> >
>> > or
>> >
>> > Generator: 0 records selected for fetching, exiting ...
>> > Stopping at depth=0 - no more URLs to fetch.
>> >
>> >
>> > I can't tell if the crawler has managed to fetch any data. How can I
>> > extract
>> > whatever data is has downloaded?
>> >
>> > thanks,
>> > Barry
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Simple-crawl-fails-to-find-any-URLs-tp15143487p15156232.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to