[ 
https://issues.apache.org/jira/browse/NUTCH-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822958#comment-16822958
 ] 

Prajeeth Emanuel commented on NUTCH-2706:
-----------------------------------------

Any updates on this? Let me know if there are any details you might require to 
reproduce the bug.

Currently updated the schema of both Nutch indexer and Solr to store binary 
content as text_general as a workaround. Replacing space delimiters (\t, \n, 
\r) with an empty string gives the original binary content.

Any disadvantages/errors to this approach apart from occupying additional space?

> -addBinaryContent flag can cause "String length must be a multiple of four" 
> error in IndexingJob
> ------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.15
>         Environment: Solr:7.3.1
> Nutch: 1.15
>            Reporter: Prajeeth Emanuel
>            Priority: Major
>
> When using the following crawling command:
> bin/crawl -i -s /user/xxxx/seed /user/xxxx/test-crawl-8 3 
> with the index command in the crawl script with -addBinaryContent and -base64.
> The error I get is:
> 2019-04-04 04:10:43,702 svnNumber= clientHw="" userId="" actionKpi="" [main] 
> WARN org.apache.hadoop.mapred.YarnChild - Exception running child : 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: ERROR: 
> [doc=73ad5e05e49054efa258e7c54ae9b9ee] Error adding field 
> 'binaryContent'='PCFET0NUWVBFIGh0bWw+DQo8aHRtbCBsYW5nPSJlbiI+DQo8aGVhZD4NCgk8bWV0YSBodHRwLWVx...
>  
> ...
>  
> msg=String length must be a multiple of four. at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:559)
>  at  at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:251) 
> at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:47)
>  at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
>  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629) at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>  
> I see this https://issues.apache.org/jira/browse/NUTCH-2186 as well. Opening 
> a new ticket as mentioned in the comments because I have a different 
> environment.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to