Re: Pushing content to Solr from Nutch

Xavier Morera Thu, 10 Apr 2014 12:16:42 -0700

Wait, ignore my last email. The issue is on the solr side!


On Thu, Apr 10, 2014 at 1:13 PM, Xavier Morera <xav...@familiamorera.com>wrote:

> Thanks Julien and Sebastian. Tried that and got the exception below. Is
> there a way of knowing more in detail what is the exception so that I can
> continue troubleshooting? I am getting really really close! I also attach
> the full output.
>
> This is the exception, but no additional info
> Indexer: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>
>
> Also I found this which means that something is actually happening
> ndexing 20140410124128 on SOLR index -> http://localhost:8983/solr
> cygpath: can't convert empty path
> Indexer: starting at 2014-04-10 12:41:42
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
>
>
> My full nutch-site.xml is
>
> *<?xml version="1.0"?>*
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
> *<!-- Put site-specific property overrides in this file. -->*
> *<configuration>*
> * <property>*
> * <name>http.agent.name <http://http.agent.name></name>*
> * <value>nutch-solr-integration</value>*
> * </property>*
> * <property>*
> * <name>generate.max.per.host</name>*
> * <value>100</value>*
> * </property>*
>  * <property>*
> * <name>plugin.includes</name>*
> *
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>*
> * </property>*
> * <property>*
> * <name>fs.file.impl</name>*
> *
> <value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>*
> * <description>Enables patch for issue HADOOP-7682 on
> Windows</description>*
> * </property>*
> *</configuration>*
>
> And for urls/site.txt I have
> http://www.trenurbano.co.cr
>
> And in regex-urlfilter.txt I have
> +^http://([a-z0-9]*\.)*trenurbano.co.cr/
>
> Thanks in advance,
> Xavier
>
>
>
> On Thu, Apr 10, 2014 at 12:35 PM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
>> Hi Xavier
>>
>> Your config file looks a bit outdated. Here are the values set by default
>> (see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml)
>>
>> <property>
>>   <name>plugin.includes</name>
>>   
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>
>> </property>
>>
>> Your problem comes from the fact that you are missing indexer-solr.
>>
>> You should not need  
>> *query-(basic|site|url)|response-(json|xml)|summary-basic *as they date back 
>> to times immemorial when we used to manage the indexing and search ourselves.
>>
>> HTH
>>
>> Julien
>>
>>
>> On 10 April 2014 18:05, Xavier Morera <xav...@familiamorera.com> wrote:
>>
>>> Hi,
>>>
>>> I have followed several Nutch tutorials - including the main one
>>> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which
>>> works, I can see in the console as the pages get crawled and the
>>> directories built with the data) but for the life of me I can't get
>>> anything posted to Solr. The Solr console doesn't even squint, therefore
>>> Nutch is not sending anything.
>>>
>>> This is the command that I send over that crawls and in theory should
>>> also post
>>> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2
>>>
>>> But I found that I could also use this one when it is already crawled
>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb
>>> crawl/linkdb crawl/segments/*
>>>
>>> But no luck.
>>>
>>> This is the only thing that called my attention but I read that by
>>> adding the property below would work but doesn't work.
>>> *No IndexWriters activated - check your configuration*
>>>
>>> This is the property
>>> <property>
>>> <name>plugin.includes</name>
>>>
>>> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>> </property>
>>>
>>> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.
>>>
>>> --
>>> *Xavier Morera*
>>> email: xav...@familiamorera.com
>>> CR: +(506) 8849 8866
>>> US: +1 (305) 600 4919
>>> skype: xmorera
>>>
>>>
>>>
>>> --
>>> *Xavier Morera*
>>> email: xav...@familiamorera.com
>>> CR: +(506) 8849 8866
>>> US: +1 (305) 600 4919
>>> skype: xmorera
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> *Xavier Morera*
> email: xav...@familiamorera.com
> CR: +(506) 8849 8866
> US: +1 (305) 600 4919
> skype: xmorera
>



-- 
*Xavier Morera*
email: xav...@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera

Re: Pushing content to Solr from Nutch

Reply via email to