[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819658#comment-13819658 ] Hudson commented on NUTCH-1100: --- SUCCESS: Integrated in Nutch-trunk #2419 (See [https://builds.apache.org/job/Nutch-trunk/2419/]) NUTCH-1100 avoid NPE in SOLRDedup (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1540758) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1100-1.6-1.patch > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446002#comment-13446002 ] Luca Cavanna commented on NUTCH-1100: - The problem with the approach I mentioned before is that the field digest would need to be made indexed in the solr schema, otherwise that query would always return 0 results. > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1100-1.6-1.patch > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445930#comment-13445930 ] Luca Cavanna commented on NUTCH-1100: - I agree, it would make even more sense to filter the query like this: digest:[* TO *] . This way nutch wouldn't even iterate over documents that don't have a value for the digest field. Unfortunately this problem is pretty common, it happens all the time if you have in Solr documents that don't come from nutch, together with the crawled documents. > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1100-1.6-1.patch > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437843#comment-13437843 ] lufeng commented on NUTCH-1100: --- Maybe it is a setting problem, do you change the mapping field in solrindex-mapping.xml, if you change the dest name of the field. The solr will not find the digest field. > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1100-1.6-1.patch > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425207#comment-13425207 ] Hernan commented on NUTCH-1100: --- This fields are required: SolrConstants.ID_FIELD("id") SolrConstants.BOOST_FIELD ("boost") SolrConstants.TIMESTAMP_FIELD ("tstamp") SolrConstants.DIGEST_FIELD("digest") if you had indexed in solr outside of nutch, for example DataImportHandler, you should be set this fields with: a) Add the fields when you index your documents b) for copy from other field add to schema-solr4.xml the bellow: c) Modified the source SolrDeleteDuplicates similar to the attached patch, but for all fields (boost, tstamp, digest), the field id you should was set. d) Change the SOLR_GET_ALL_QUERY for only select the generated records for nutch (This maybe should be one good generic change) Sorry for my lousy english. > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1100-1.6-1.patch > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280357#comment-13280357 ] Ashish Shrowty commented on NUTCH-1100: --- were you able to resolve this issue? i am consistently getting this error ... > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.6 > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
fw commented on NUTCH-1100 SolrDedup broken I think this is caused by "digest" field,nutch did not catch it's value. I think this can be resolved by setting the conf files. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094439#comment-13094439 ] Markus Jelsma commented on NUTCH-1100: -- The above exception can appear from out of thin air, i've seen it happening times and times again. Just now i suddenly saw a long running test cycle magically repair itself. The dedup job failed weeks ago for the first time and until just now continued to fail at each cycle. I still have no idea on how to consistently reproduce this behaviour. > SolrDedup broken > > > Key: NUTCH-1100 > URL: https://issues.apache.org/jira/browse/NUTCH-1100 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.4 >Reporter: Markus Jelsma > Fix For: 1.4 > > > Some Solr indices are unable to be deduped from Nutch. For unknown reasons > Nutch will throw the exception below. There are no peculiarities to be found > in the Solr logs, the queries are normal and seem to succeed. > {code} > java.lang.NullPointerException > at org.apache.hadoop.io.Text.encode(Text.java:388) > at org.apache.hadoop.io.Text.set(Text.java:178) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira