[jira] [Commented] (NUTCH-1100) SolrDedup broken

Luca Cavanna (JIRA) Fri, 31 Aug 2012 06:45:11 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445930#comment-13445930
 ]


Luca Cavanna commented on NUTCH-1100:
-------------------------------------

I agree, it would make even more sense to filter the query like this: digest:[* 
TO *] .
This way nutch wouldn't even iterate over documents that don't have a value for 
the digest field.
Unfortunately this problem is pretty common, it happens all the time if you 
have in Solr documents that don't come from nutch, together with the crawled 
documents.
                
> SolrDedup broken
> ----------------
>
>                 Key: NUTCH-1100
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1100
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.io.Text.encode(Text.java:388)
>         at org.apache.hadoop.io.Text.set(Text.java:178)
>         at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
>         at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1100) SolrDedup broken

Reply via email to