Hi - this is NUTCH-1016, which was never ported to 2.x. https://issues.apache.org/jira/browse/NUTCH-1016
-----Original message----- > From:Kshitij Shukla <kshiti...@cisinlabs.com> > Sent: Monday 25th January 2016 8:23 > To: user@nutch.apache.org > Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception > > Hello everyone, > > During a very large crawl when indexing to Solr this will yield the > following exception: > > ************************************************** > root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin# > /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch > index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -D > solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1 > 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting > 16/01/25 11:44:53 INFO Configuration.deprecation: > mapred.output.key.comparator.class is deprecated. Instead, use > mapreduce.job.output.key.comparator.class > 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in: > /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins > 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins: > 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework > (lib-http) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags > (parse-metatags) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter > (index-html) > 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core > extension points (nutch-extensionpoints) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing > Filter (index-basic) > 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing > Filter (index-anchor) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Language > Identification Parser/Filter (language-identifier) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing > Filter (index-metadata) > 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML > Parser (lib-nekohtml) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection > indexing and query filter (subcollection) > 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter > (indexer-solr) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat > Parser/Indexer/Querier (microformats-reltag) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https > Protocol Plug-in (protocol-httpclient) > 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser > (parse-js) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain > Plugin (tld) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis > Scoring Plug-in (scoring-link) > 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter > (index-more) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol > Plug-in (protocol-http) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons > Plugins (creativecommons) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points: > 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning > Filter (org.apache.nutch.indexer.IndexCleaningFilter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer > (org.apache.nutch.indexer.IndexWriter) > 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing > Filter (org.apache.nutch.indexer.IndexingFilter) > 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding > org.apache.nutch.indexer.html.HtmlIndexingFilter > 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length > for indexing set to: 100 > 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication > is: off > 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for > job: job_1453472314066_0007 > 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application > application_1453472314066_0007 > 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job: > http://cism479:8088/proxy/application_1453472314066_0007/ > 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007 > 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running > in uber mode : false > 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0% > 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0% > 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0% > 16/01/25 11:49:29 INFO mapreduce.Job: Task Id : > attempt_1453472314066_0007_m_000000_0, Status : FAILED > Error: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at > char #1296459, byte #1310719) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84) > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0% > 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0% > 16/01/25 11:53:01 INFO mapreduce.Job: Task Id : > attempt_1453472314066_0007_m_000000_1, Status : FAILED > Error: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at > char #1296459, byte #1310719) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84) > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0% > 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0% > 16/01/25 11:54:52 INFO mapreduce.Job: Task Id : > attempt_1453472314066_0007_m_000000_2, Status : FAILED > Error: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at > char #1296459, byte #1310719) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84) > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) > at > org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0% > 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0% > 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed > with state FAILED due to: Task failed task_1453472314066_0007_m_000000 > Job failed as tasks failed. failedMaps:1 failedReduces:0 > > 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33 > File System Counters > FILE: Number of bytes read=0 > FILE: Number of bytes written=116194 > FILE: Number of read operations=0 > FILE: Number of large read operations=0 > FILE: Number of write operations=0 > HDFS: Number of bytes read=1033 > HDFS: Number of bytes written=0 > HDFS: Number of read operations=1 > HDFS: Number of large read operations=0 > HDFS: Number of write operations=0 > Job Counters > Failed map tasks=4 > Launched map tasks=5 > Other local map tasks=3 > Data-local map tasks=2 > Total time spent by all maps in occupied slots (ms)=3168342 > Total time spent by all reduces in occupied slots (ms)=0 > Total time spent by all map tasks (ms)=1056114 > Total vcore-seconds taken by all map tasks=1056114 > Total megabyte-seconds taken by all map tasks=3244382208 > Map-Reduce Framework > Map input records=2762511 > Map output records=17629 > Input split bytes=1033 > Spilled Records=0 > Failed Shuffles=0 > Merged Map outputs=0 > GC time elapsed (ms)=2995 > CPU time spent (ms)=116860 > Physical memory (bytes) snapshot=1272868864 > Virtual memory (bytes) snapshot=5104431104 > Total committed heap usage (bytes)=1017118720 > IndexerJob > DocumentCount=17629 > File Input Format Counters > Bytes Read=0 > File Output Format Counters > Bytes Written=0 > 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob: > java.lang.RuntimeException: job failed: name=[1]Indexer, > jobid=job_1453472314066_0007 > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > ******************************************************* > -- > > Please let me know if you have any questions , concerns or updates. > Have a great day ahead :) > > Thanks and Regards, > > Kshitij Shukla > Software developer > > *Cyber Infrastructure(CIS) > **/The RightSourcing Specialists with 1250 man years of experience!/* > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. > > Please don't print this e-mail unless you really need to. > > -- > > ------------------------------ > > *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)* > > Central India's largest Technology company. > > *Ensuring the success of our clients and partners through our highly > optimized Technology solutions.* > > www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin > <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | > Offices: *Indore, India.* *Singapore. Silicon Valley, USA*. > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. >