Maximum size of row in HBase is 32767 and your application is trying to get
rows which exceed this limit.

You probably better ask your question in gora/nutch user group.

-Vlad

On Thu, Jan 21, 2016 at 5:39 AM, Kshitij Shukla <kshiti...@cisinlabs.com>
wrote:

> Hello everyone,
>
> Software stack is *nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2,
> hbase-0.98.8-hadoop2
>
> * I have added a set of seeds to crawl using this command
> *
> ./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4*
>
> For first iteration all of the commands(*inject, **generate, **fetch,
> **parse, **update-table, **Indexer & delete duplicates.*) got executed
> successfully.
> For second iteration, *"CrawlDB update" *command got failed (please see
> error log for reference), because of failure of this command the whole
> process gets terminated.
>
>
> ****************************************************LOG
> START************************************************************************************************
> 16/01/20 02:45:19 INFO parse.ParserJob: ParserJob: finished at 2016-01-20
> 02:45:19, time elapsed: 00:06:57
> CrawlDB update for 1
> /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
> updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true 1453230757-13191 -crawlId 1
> 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at
> 2016-01-20 02:45:27
> 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId:
> 1453230757-13191
> 16/01/20 02:45:27 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-root/hadoop-unjar5654418190157422003/classes/plugins
> 16/01/20 02:45:28 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Plugins:
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     HTTP Framework
> (lib-http)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Html Parse Plug-in
> (parse-html)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     MetaTags
> (parse-metatags)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     the nutch core
> extension points (nutch-extensionpoints)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Basic Indexing Filter
> (index-basic)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Anchor Indexing Filter
> (index-anchor)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Basic URL Normalizer
> (urlnormalizer-basic)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Language
> Identification Parser/Filter (language-identifier)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Metadata Indexing
> Filter (index-metadata)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     CyberNeko HTML Parser
> (lib-nekohtml)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Subcollection indexing
> and query filter (subcollection)
> 16/01/20 02:45:28 INFO plugin.PluginRepository: SOLRIndexWriter
> (indexer-solr)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Http / Https Protocol
> Plug-in (protocol-httpclient)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     JavaScript Parser
> (parse-js)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Tika Parser Plug-in
> (parse-tika)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Top Level Domain
> Plugin (tld)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Regex URL Filter
> Framework (lib-regex-filter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Regex URL Normalizer
> (urlnormalizer-regex)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Link Analysis Scoring
> Plug-in (scoring-link)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
> (scoring-opic)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     More Indexing Filter
> (index-more)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Http Protocol Plug-in
> (protocol-http)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Creative Commons
> Plugins (creativecommons)
> 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Index Cleaning
> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch URL Filter (
> org.apache.nutch.net.URLFilter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch URL Normalizer (
> org.apache.nutch.net.URLNormalizer)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Index Writer
> (org.apache.nutch.indexer.IndexWriter)
> 16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 16/01/20 02:45:29 INFO Configuration.deprecation:
> mapred.map.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.map.speculative
> 16/01/20 02:45:29 INFO Configuration.deprecation:
> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.reduce.speculative
> 16/01/20 02:45:29 INFO Configuration.deprecation:
> mapred.compress.map.output is deprecated. Instead, use
> mapreduce.map.output.compress
> 16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks is
> deprecated. Instead, use mapreduce.job.reduces
> 16/01/20 02:45:29 INFO zookeeper.RecoverableZooKeeper: Process
> identifier=hconnection-0x60a2630a connecting to ZooKeeper
> ensemble=localhost:2181
> 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client
> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
> 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:host.name
> =cism479
> 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client
> environment:java.version=1.8.0_65
> 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client
> environment:java.vendor=Oracle Corporation
> 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client
> environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
> 16/01/20 02:45:35 INFO zookeeper.ClientCnxn: EventThread shut down
> 16/01/20 02:45:35 INFO mapreduce.JobSubmitter: number of splits:2
> 16/01/20 02:45:36 INFO mapreduce.JobSubmitter: Submitting tokens for job:
> job_1453210838763_0011
> 16/01/20 02:45:36 INFO impl.YarnClientImpl: Submitted application
> application_1453210838763_0011
> 16/01/20 02:45:36 INFO mapreduce.Job: The url to track the job:
> http://cism479:8088/proxy/application_1453210838763_0011/
> 16/01/20 02:45:36 INFO mapreduce.Job: Running job: job_1453210838763_0011
> 16/01/20 02:45:48 INFO mapreduce.Job: Job job_1453210838763_0011 running
> in uber mode : false
> 16/01/20 02:45:48 INFO mapreduce.Job:  map 0% reduce 0%
> 16/01/20 02:47:31 INFO mapreduce.Job:  map 33% reduce 0%
> 16/01/20 02:47:47 INFO mapreduce.Job:  map 50% reduce 0%
> 16/01/20 02:48:08 INFO mapreduce.Job:  map 83% reduce 0%
> 16/01/20 02:48:16 INFO mapreduce.Job:  map 100% reduce 0%
> 16/01/20 02:48:31 INFO mapreduce.Job:  map 100% reduce 31%
> 16/01/20 02:48:34 INFO mapreduce.Job:  map 100% reduce 33%
> 16/01/20 02:50:30 INFO mapreduce.Job:  map 100% reduce 34%
> 16/01/20 03:01:18 INFO mapreduce.Job:  map 100% reduce 35%
> 16/01/20 03:11:58 INFO mapreduce.Job:  map 100% reduce 36%
> 16/01/20 03:22:50 INFO mapreduce.Job:  map 100% reduce 37%
> 16/01/20 03:24:22 INFO mapreduce.Job:  map 100% reduce 50%
> 16/01/20 03:24:35 INFO mapreduce.Job:  map 100% reduce 82%
> 16/01/20 03:24:38 INFO mapreduce.Job:  map 100% reduce 83%
> 16/01/20 03:26:33 INFO mapreduce.Job:  map 100% reduce 84%
> 16/01/20 03:37:35 INFO mapreduce.Job:  map 100% reduce 85%
> 16/01/20 03:39:38 INFO mapreduce.Job: Task Id :
> attempt_1453210838763_0011_r_000001_0, Status : FAILED
> *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767*
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
>     at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
>     at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/20 03:39:39 INFO mapreduce.Job:  map 100% reduce 50%
> 16/01/20 03:39:52 INFO mapreduce.Job:  map 100% reduce 82%
> 16/01/20 03:39:55 INFO mapreduce.Job:  map 100% reduce 83%
> 16/01/20 03:41:56 INFO mapreduce.Job:  map 100% reduce 84%
> 16/01/20 03:53:39 INFO mapreduce.Job:  map 100% reduce 85%
> 16/01/20 03:55:49 INFO mapreduce.Job: Task Id :
> attempt_1453210838763_0011_r_000001_1, Status : FAILED
> *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767*
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
>     at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
>     at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/20 03:55:50 INFO mapreduce.Job:  map 100% reduce 50%
> 16/01/20 03:56:01 INFO mapreduce.Job:  map 100% reduce 83%
> 16/01/20 03:58:02 INFO mapreduce.Job:  map 100% reduce 84%
> 16/01/20 04:10:09 INFO mapreduce.Job:  map 100% reduce 85%
> 16/01/20 04:12:33 INFO mapreduce.Job: Task Id :
> attempt_1453210838763_0011_r_000001_2, Status : FAILED
> *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767*
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
>     at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
>     at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
>     at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
>     at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
>     at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
>     at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 16/01/20 04:12:34 INFO mapreduce.Job:  map 100% reduce 50%
> 16/01/20 04:12:45 INFO mapreduce.Job:  map 100% reduce 82%
> 16/01/20 04:12:48 INFO mapreduce.Job:  map 100% reduce 83%
> 16/01/20 04:14:46 INFO mapreduce.Job:  map 100% reduce 84%
> 16/01/20 04:26:53 INFO mapreduce.Job:  map 100% reduce 85%
> 16/01/20 04:29:09 INFO mapreduce.Job:  map 100% reduce 100%
> 16/01/20 04:29:10 INFO mapreduce.Job: Job job_1453210838763_0011 failed
> with state FAILED due to: Task failed task_1453210838763_0011_r_000001
> Job failed as tasks failed. failedMaps:0 failedReduces:1
>
> 16/01/20 04:29:11 INFO mapreduce.Job: Counters: 50
>     File System Counters
>         FILE: Number of bytes read=38378343
>         FILE: Number of bytes written=115957636
>         FILE: Number of read operations=0
>         FILE: Number of large read operations=0
>         FILE: Number of write operations=0
>         HDFS: Number of bytes read=2382
>         HDFS: Number of bytes written=0
>         HDFS: Number of read operations=2
>         HDFS: Number of large read operations=0
>         HDFS: Number of write operations=0
>     Job Counters
>         Failed reduce tasks=4
>         Launched map tasks=2
>         Launched reduce tasks=5
>         Data-local map tasks=2
>         Total time spent by all maps in occupied slots (ms)=789909
>         Total time spent by all reduces in occupied slots (ms)=30215090
>         Total time spent by all map tasks (ms)=263303
>         Total time spent by all reduce tasks (ms)=6043018
>         Total vcore-seconds taken by all map tasks=263303
>         Total vcore-seconds taken by all reduce tasks=6043018
>         Total megabyte-seconds taken by all map tasks=808866816
>         Total megabyte-seconds taken by all reduce tasks=30940252160
>     Map-Reduce Framework
>         Map input records=49929
>         Map output records=1777904
>         Map output bytes=382773368
>         Map output materialized bytes=77228942
>         Input split bytes=2382
>         Combine input records=0
>         Combine output records=0
>         Reduce input groups=754170
>         Reduce shuffle bytes=38318183
>         Reduce input records=881156
>         Reduce output records=754170
>         Spilled Records=2659060
>         Shuffled Maps =2
>         Failed Shuffles=0
>         Merged Map outputs=2
>         GC time elapsed (ms)=17993
>         CPU time spent (ms)=819690
>         Physical memory (bytes) snapshot=4080136192
>         Virtual memory (bytes) snapshot=15234293760
>         Total committed heap usage (bytes)=4149739520
>     Shuffle Errors
>         BAD_ID=0
>         CONNECTION=0
>         IO_ERROR=0
>         WRONG_LENGTH=0
>         WRONG_MAP=0
>         WRONG_REDUCE=0
>     File Input Format Counters
>         Bytes Read=0
>     File Output Format Counters
>         Bytes Written=0
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=[1]update-table, jobid=job_1453210838763_0011
>     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
>     at
> org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:497)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Error running:
> /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch
> updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true 1453230757-13191 -crawlId 1
> Failed with exit value 1.
> ****************************************************LOG
> END************************************************************************************************
>
> Please advise.
>
> --
>
> ------------------------------
>
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>
> Central India's largest Technology company.
>
> *Ensuring the success of our clients and partners through our highly
> optimized Technology solutions.*
>
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <
> https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>
> DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>

Reply via email to