[jira] [Commented] (NUTCH-1100) SolrDedup broken

2013-11-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819658#comment-13819658
 ] 

Hudson commented on NUTCH-1100:
---

SUCCESS: Integrated in Nutch-trunk #2419 (See 
[https://builds.apache.org/job/Nutch-trunk/2419/])
NUTCH-1100 avoid NPE in SOLRDedup (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1540758)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java


> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446002#comment-13446002
 ] 

Luca Cavanna commented on NUTCH-1100:
-

The problem with the approach I mentioned before is that the field digest would 
need to be made indexed in the solr schema, otherwise that query would always 
return 0 results.


> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445930#comment-13445930
 ] 

Luca Cavanna commented on NUTCH-1100:
-

I agree, it would make even more sense to filter the query like this: digest:[* 
TO *] .
This way nutch wouldn't even iterate over documents that don't have a value for 
the digest field.
Unfortunately this problem is pretty common, it happens all the time if you 
have in Solr documents that don't come from nutch, together with the crawled 
documents.

> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437843#comment-13437843
 ] 

lufeng commented on NUTCH-1100:
---

Maybe it is a setting problem, do you change the mapping field

in solrindex-mapping.xml, if you change the dest name of the field. The solr 
will not find the digest field.

> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-07-30 Thread Hernan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425207#comment-13425207
 ] 

Hernan commented on NUTCH-1100:
---

This fields are required:

SolrConstants.ID_FIELD("id")
SolrConstants.BOOST_FIELD ("boost")
SolrConstants.TIMESTAMP_FIELD ("tstamp")
SolrConstants.DIGEST_FIELD("digest")


if you had indexed in solr outside of nutch, for example DataImportHandler, you 
should be set this fields with:

a) Add the fields when you index your documents

b) for copy from other field add to schema-solr4.xml the bellow:
  
  
  

c) Modified the source SolrDeleteDuplicates similar to the attached patch, but 
for all fields (boost, tstamp, digest), the field id you should was set.

d) Change the SOLR_GET_ALL_QUERY for only select the generated records for 
nutch (This maybe should be one good generic change)

Sorry for my lousy english.

> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1100-1.6-1.patch
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-05-21 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280357#comment-13280357
 ] 

Ashish Shrowty commented on NUTCH-1100:
---

were you able to resolve this issue? i am consistently getting this error ...

> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.6
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-05-14 Thread fw (JIRA)














































fw
 commented on  NUTCH-1100


SolrDedup broken















I think this is caused by "digest" field,nutch did not catch it's value.
I think this can be resolved by setting the conf files.



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira






[jira] [Commented] (NUTCH-1100) SolrDedup broken

2011-08-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094439#comment-13094439
 ] 

Markus Jelsma commented on NUTCH-1100:
--

The above exception can appear from out of thin air, i've seen it happening 
times and times again. Just now i suddenly saw a long running test cycle 
magically repair itself. The dedup job failed weeks ago for the first time and 
until just now continued to fail at each cycle.

I still have no idea on how to consistently reproduce this behaviour.

> SolrDedup broken
> 
>
> Key: NUTCH-1100
> URL: https://issues.apache.org/jira/browse/NUTCH-1100
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.4
>Reporter: Markus Jelsma
> Fix For: 1.4
>
>
> Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
> Nutch will throw the exception below. There are no peculiarities to be found 
> in the Solr logs, the queries are normal and seem to succeed.
> {code}
> java.lang.NullPointerException
> at org.apache.hadoop.io.Text.encode(Text.java:388)
> at org.apache.hadoop.io.Text.set(Text.java:178)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira