[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

Vladimir Garvardt (JIRA) Sat, 21 Jun 2008 06:33:16 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607005#action_12607005
 ]


Vladimir Garvardt commented on NUTCH-442:
-----------------------------------------

Hello.

I'm trying to apply this patch and faced a problem that I cannot solve by 
myself.

I checked out nutch trunk (rev 670194), downloaded attachments from this issue 
and started patching.
First I applied Crawl.patch, then Indexer.patch and then NUTCH-442_v5.patch. On 
applying last patch I got warning message. This happened because of conflict 
between Crawl.patch and NUTCH-442_v5.patch.

Crawl.patch performs the following action:
// index, dedup & merge
+      indexer.index(indexes, solrUrl, crawlDb, linkDb,
+          Arrays.asList(fs.listPaths(segments, 
HadoopFSUtil.getPassAllFilter())));

and NUTCH-442_v5.patch performs the following action
       // index, dedup & merge
-      indexer.index(indexes, crawlDb, linkDb, fs.listPaths(segments, 
HadoopFSUtil.getPassAllFilter()));
+      indexer.index(indexes, null, crawlDb, linkDb,
+          Arrays.asList(fs.listPaths(segments, 
HadoopFSUtil.getPassAllFilter())));


The main between this patches in second parameter.
First I tried to build nutch with second parameter set to null - crawling 
finished successfully, but no data was added to solr.
Then I changed second parameter to solrUrl and rebuilt nutch. On indexing 
following Exception was caught and indexing failed (no data in solr):
Indexer: starting
Indexer: crawldb: crawl/crawldb
Indexer: linkdb: crawl/linkdb
Indexer: solrUrl: http://localhost:8984/solr/
Indexer: adding segment: 
file:/home/vladimirga/Documents/dev/src/lucene-src/nutch-2008-06-21/wrk-01/crawl/segments/20080621200352
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:318)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:148)

What can cause that problem and how can I fix it to make nutch index into solr?

Thanks.

> Integrate Solr/Nutch
> --------------------
>
>                 Key: NUTCH-442
>                 URL: https://issues.apache.org/jira/browse/NUTCH-442
>             Project: Nutch
>          Issue Type: New Feature
>         Environment: Ubuntu linux
>            Reporter: rubdabadub
>         Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch, 
> NUTCH-442_v5.patch, NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, 
> schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here 
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
>  and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I 
> am trying to eliminate my python based crawler which post documents to solr. 
> As I am in the corporate enviornment I can't install trunk version in the 
> production enviornment thus I am asking this to be included in 0.9 release. I 
> hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

Reply via email to