Set logging to debug, HttpClient then logs what's being sent over the wire so you can catch the data. It is less tedious than Wireshark.
-----Original message----- > From:Michael Coffey <mcof...@yahoo.com.INVALID> > Sent: Friday 1st September 2017 5:12 > To: user@nutch.apache.org > Subject: Re: invalid utf8 chars when indexing or cleaning > > It sounds like a good suggestion, but I don't know what you mean by "verify > the output Nutch generates and inspect it manually." How do I get a look at > that XML? > > > From: > To: "user@nutch.apache.org" <user@nutch.apache.org> > Sent: Thursday, August 31, 2017 11:59 AM > Subject: RE: invalid utf8 chars when indexing or cleaning > > The bug is identical, but i fixed it! You should verify the output Nutch > generates and inspect it manually, there should be a 0xffff at that byte. If > it really is there, we need to check the fix once more, despite that i am > sure the patch works as intended. > > Get the XML, pass it through the method and see what it does to the output. > > > > -----Original message----- > > From:Jorge Betancourt <betancourt.jo...@gmail.com> > > Sent: Tuesday 29th August 2017 21:54 > > To: user@nutch.apache.org > > Subject: Re: invalid utf8 chars when indexing or cleaning > > > > From the logs looks like the error is coming from the Solr side, do you > > mind checking/sharing the logs on your Solr server? Can you pin point which > > URL is causing the issue? > > Best Regards, Jorge > > > > On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <mcof...@yahoo.com.invalid> > > wrote: > > Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 > > bug that was fixed in version 1.4. > > Some more bits of information: the indexer job rarely fails (only 1 of the > > last 99 segments) but the cleaning job fails every time now. Once again, > > this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and > > Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of > > mismatch of versions? > > > > > > To: User <user@nutch.apache.org> > > Sent: Thursday, August 24, 2017 7:42 PM > > Subject: invalid utf8 chars when indexing or cleaning > > > > Lately, I have seen many tasks and jobs fail in Solr when doing nutch index > > and nutch clean. > > Messages during indexing look like this. > > 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% > > 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : > > attempt_1502929850483_1329_r_000007_2, Status : FAILED > > Error: > > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > > from server at http://codero4.neocortix.com:8984/solr/popular: [ > > com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char > > #104705, byte #219135) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) > > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) > > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) > > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) > > at > > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) > > > > Messages during cleaning look like this. > > 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 > > INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status > > : FAILEDError: > > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > > from server at http://codero4.neocortix.com:8984/solr/popular: > > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char > > #16099, byte #16383) at > > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > > > > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) > > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) > > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) > > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) > > > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) > > > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) > > > > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at > > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) > > > > at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245) > > Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. > > I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing > > this happening with Hadoop 2.7.2 and java 1.7. It happens very often now. > >