Hi - this is NUTCH-1016, which was never ported to 2.x.

https://issues.apache.org/jira/browse/NUTCH-1016

 
 
-----Original message-----
> From:Kshitij Shukla <kshiti...@cisinlabs.com>
> Sent: Monday 25th January 2016 8:23
> To: user@nutch.apache.org
> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> 
> Hello everyone,
> 
> During a very large crawl when indexing to Solr this will yield the 
> following exception:
> 
> **************************************************
> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin# 
> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch 
> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
> mapred.reduce.tasks.speculative.execution=false -D 
> mapred.map.tasks.speculative.execution=false -D 
> mapred.compress.map.output=true -D 
> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> 16/01/25 11:44:53 INFO Configuration.deprecation: 
> mapred.output.key.comparator.class is deprecated. Instead, use 
> mapreduce.job.output.key.comparator.class
> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in: 
> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation 
> mode: [true]
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework 
> (lib-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in 
> (parse-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags 
> (parse-metatags)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter 
> (index-html)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core 
> extension points (nutch-extensionpoints)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing 
> Filter (index-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing 
> Filter (index-anchor)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer 
> (urlnormalizer-basic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Language 
> Identification Parser/Filter (language-identifier)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing 
> Filter (index-metadata)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML 
> Parser (lib-nekohtml)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection 
> indexing and query filter (subcollection)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter 
> (indexer-solr)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat 
> Parser/Indexer/Querier (microformats-reltag)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https 
> Protocol Plug-in (protocol-httpclient)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser 
> (parse-js)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in 
> (parse-tika)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain 
> Plugin (tld)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter 
> Framework (lib-regex-filter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer 
> (urlnormalizer-regex)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis 
> Scoring Plug-in (scoring-link)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in 
> (scoring-opic)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter 
> (index-more)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol 
> Plug-in (protocol-http)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons 
> Plugins (creativecommons)
> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter 
> (org.apache.nutch.parse.ParseFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning 
> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser 
> (org.apache.nutch.parse.Parser)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter 
> (org.apache.nutch.net.URLFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring 
> (org.apache.nutch.scoring.ScoringFilter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer 
> (org.apache.nutch.net.URLNormalizer)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol 
> (org.apache.nutch.protocol.Protocol)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer 
> (org.apache.nutch.indexer.IndexWriter)
> 16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing 
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
> org.apache.nutch.indexer.html.HtmlIndexingFilter
> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length 
> for indexing set to: 100
> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication 
> is: off
> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for 
> job: job_1453472314066_0007
> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application 
> application_1453472314066_0007
> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job: 
> http://cism479:8088/proxy/application_1453472314066_0007/
> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running 
> in uber mode : false
> 16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id : 
> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> Error: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at 
> char #1296459, byte #1310719)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>      at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>      at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>      at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>      at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
>      at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
>      at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
>      at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
>      at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>      at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
>      at 
> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
>      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> 
> 16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed 
> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> Job failed as tasks failed. failedMaps:1 failedReduces:0
> 
> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
>      File System Counters
>          FILE: Number of bytes read=0
>          FILE: Number of bytes written=116194
>          FILE: Number of read operations=0
>          FILE: Number of large read operations=0
>          FILE: Number of write operations=0
>          HDFS: Number of bytes read=1033
>          HDFS: Number of bytes written=0
>          HDFS: Number of read operations=1
>          HDFS: Number of large read operations=0
>          HDFS: Number of write operations=0
>      Job Counters
>          Failed map tasks=4
>          Launched map tasks=5
>          Other local map tasks=3
>          Data-local map tasks=2
>          Total time spent by all maps in occupied slots (ms)=3168342
>          Total time spent by all reduces in occupied slots (ms)=0
>          Total time spent by all map tasks (ms)=1056114
>          Total vcore-seconds taken by all map tasks=1056114
>          Total megabyte-seconds taken by all map tasks=3244382208
>      Map-Reduce Framework
>          Map input records=2762511
>          Map output records=17629
>          Input split bytes=1033
>          Spilled Records=0
>          Failed Shuffles=0
>          Merged Map outputs=0
>          GC time elapsed (ms)=2995
>          CPU time spent (ms)=116860
>          Physical memory (bytes) snapshot=1272868864
>          Virtual memory (bytes) snapshot=5104431104
>          Total committed heap usage (bytes)=1017118720
>      IndexerJob
>          DocumentCount=17629
>      File Input Format Counters
>          Bytes Read=0
>      File Output Format Counters
>          Bytes Written=0
> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob: 
> java.lang.RuntimeException: job failed: name=[1]Indexer, 
> jobid=job_1453472314066_0007
>      at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>      at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:497)
>      at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> *******************************************************
> -- 
> 
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
> 
> Thanks and Regards,
> 
> Kshitij Shukla
> Software developer
> 
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
> 
> Please don't print this e-mail unless you really need to.
> 
> -- 
> 
> ------------------------------
> 
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> 
> Central India's largest Technology company.
> 
> *Ensuring the success of our clients and partners through our highly 
> optimized Technology solutions.*
> 
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin 
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | 
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> 
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
> intended recipient, you should delete this message and are notified that 
> any disclosure, copying or distribution of this message, or taking any 
> action based on it, is strictly prohibited by Law.
> 

Reply via email to