Maximum size of row in HBase is 32767 and your application is trying to get rows which exceed this limit.
You probably better ask your question in gora/nutch user group. -Vlad On Thu, Jan 21, 2016 at 5:39 AM, Kshitij Shukla <kshiti...@cisinlabs.com> wrote: > Hello everyone, > > Software stack is *nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, > hbase-0.98.8-hadoop2 > > * I have added a set of seeds to crawl using this command > * > ./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4* > > For first iteration all of the commands(*inject, **generate, **fetch, > **parse, **update-table, **Indexer & delete duplicates.*) got executed > successfully. > For second iteration, *"CrawlDB update" *command got failed (please see > error log for reference), because of failure of this command the whole > process gets terminated. > > > ****************************************************LOG > START************************************************************************************************ > 16/01/20 02:45:19 INFO parse.ParserJob: ParserJob: finished at 2016-01-20 > 02:45:19, time elapsed: 00:06:57 > CrawlDB update for 1 > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true 1453230757-13191 -crawlId 1 > 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at > 2016-01-20 02:45:27 > 16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: > 1453230757-13191 > 16/01/20 02:45:27 INFO plugin.PluginRepository: Plugins: looking in: > /tmp/hadoop-root/hadoop-unjar5654418190157422003/classes/plugins > 16/01/20 02:45:28 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Plugins: > 16/01/20 02:45:28 INFO plugin.PluginRepository: HTTP Framework > (lib-http) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 16/01/20 02:45:28 INFO plugin.PluginRepository: MetaTags > (parse-metatags) > 16/01/20 02:45:28 INFO plugin.PluginRepository: the nutch core > extension points (nutch-extensionpoints) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 16/01/20 02:45:28 INFO plugin.PluginRepository: XML Libraries (lib-xml) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Language > Identification Parser/Filter (language-identifier) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Metadata Indexing > Filter (index-metadata) > 16/01/20 02:45:28 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Subcollection indexing > and query filter (subcollection) > 16/01/20 02:45:28 INFO plugin.PluginRepository: SOLRIndexWriter > (indexer-solr) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Rel-Tag microformat > Parser/Indexer/Querier (microformats-reltag) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Http / Https Protocol > Plug-in (protocol-httpclient) > 16/01/20 02:45:28 INFO plugin.PluginRepository: JavaScript Parser > (parse-js) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Top Level Domain > Plugin (tld) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Link Analysis Scoring > Plug-in (scoring-link) > 16/01/20 02:45:28 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 16/01/20 02:45:28 INFO plugin.PluginRepository: More Indexing Filter > (index-more) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Creative Commons > Plugins (creativecommons) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Registered > Extension-Points: > 16/01/20 02:45:28 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index Cleaning > Filter (org.apache.nutch.indexer.IndexCleaningFilter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL Filter ( > org.apache.nutch.net.URLFilter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch URL Normalizer ( > org.apache.nutch.net.URLNormalizer) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Index Writer > (org.apache.nutch.indexer.IndexWriter) > 16/01/20 02:45:28 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 16/01/20 02:45:29 INFO Configuration.deprecation: > mapred.map.tasks.speculative.execution is deprecated. Instead, use > mapreduce.map.speculative > 16/01/20 02:45:29 INFO Configuration.deprecation: > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > mapreduce.reduce.speculative > 16/01/20 02:45:29 INFO Configuration.deprecation: > mapred.compress.map.output is deprecated. Instead, use > mapreduce.map.output.compress > 16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks is > deprecated. Instead, use mapreduce.job.reduces > 16/01/20 02:45:29 INFO zookeeper.RecoverableZooKeeper: Process > identifier=hconnection-0x60a2630a connecting to ZooKeeper > ensemble=localhost:2181 > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:host.name > =cism479 > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > environment:java.version=1.8.0_65 > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > environment:java.vendor=Oracle Corporation > 16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client > environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre > 16/01/20 02:45:35 INFO zookeeper.ClientCnxn: EventThread shut down > 16/01/20 02:45:35 INFO mapreduce.JobSubmitter: number of splits:2 > 16/01/20 02:45:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: > job_1453210838763_0011 > 16/01/20 02:45:36 INFO impl.YarnClientImpl: Submitted application > application_1453210838763_0011 > 16/01/20 02:45:36 INFO mapreduce.Job: The url to track the job: > http://cism479:8088/proxy/application_1453210838763_0011/ > 16/01/20 02:45:36 INFO mapreduce.Job: Running job: job_1453210838763_0011 > 16/01/20 02:45:48 INFO mapreduce.Job: Job job_1453210838763_0011 running > in uber mode : false > 16/01/20 02:45:48 INFO mapreduce.Job: map 0% reduce 0% > 16/01/20 02:47:31 INFO mapreduce.Job: map 33% reduce 0% > 16/01/20 02:47:47 INFO mapreduce.Job: map 50% reduce 0% > 16/01/20 02:48:08 INFO mapreduce.Job: map 83% reduce 0% > 16/01/20 02:48:16 INFO mapreduce.Job: map 100% reduce 0% > 16/01/20 02:48:31 INFO mapreduce.Job: map 100% reduce 31% > 16/01/20 02:48:34 INFO mapreduce.Job: map 100% reduce 33% > 16/01/20 02:50:30 INFO mapreduce.Job: map 100% reduce 34% > 16/01/20 03:01:18 INFO mapreduce.Job: map 100% reduce 35% > 16/01/20 03:11:58 INFO mapreduce.Job: map 100% reduce 36% > 16/01/20 03:22:50 INFO mapreduce.Job: map 100% reduce 37% > 16/01/20 03:24:22 INFO mapreduce.Job: map 100% reduce 50% > 16/01/20 03:24:35 INFO mapreduce.Job: map 100% reduce 82% > 16/01/20 03:24:38 INFO mapreduce.Job: map 100% reduce 83% > 16/01/20 03:26:33 INFO mapreduce.Job: map 100% reduce 84% > 16/01/20 03:37:35 INFO mapreduce.Job: map 100% reduce 85% > 16/01/20 03:39:38 INFO mapreduce.Job: Task Id : > attempt_1453210838763_0011_r_000001_0, Status : FAILED > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/20 03:39:39 INFO mapreduce.Job: map 100% reduce 50% > 16/01/20 03:39:52 INFO mapreduce.Job: map 100% reduce 82% > 16/01/20 03:39:55 INFO mapreduce.Job: map 100% reduce 83% > 16/01/20 03:41:56 INFO mapreduce.Job: map 100% reduce 84% > 16/01/20 03:53:39 INFO mapreduce.Job: map 100% reduce 85% > 16/01/20 03:55:49 INFO mapreduce.Job: Task Id : > attempt_1453210838763_0011_r_000001_1, Status : FAILED > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/20 03:55:50 INFO mapreduce.Job: map 100% reduce 50% > 16/01/20 03:56:01 INFO mapreduce.Job: map 100% reduce 83% > 16/01/20 03:58:02 INFO mapreduce.Job: map 100% reduce 84% > 16/01/20 04:10:09 INFO mapreduce.Job: map 100% reduce 85% > 16/01/20 04:12:33 INFO mapreduce.Job: Task Id : > attempt_1453210838763_0011_r_000001_2, Status : FAILED > *Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767* > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506) > at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487) > at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208) > at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156) > at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114) > at > org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 16/01/20 04:12:34 INFO mapreduce.Job: map 100% reduce 50% > 16/01/20 04:12:45 INFO mapreduce.Job: map 100% reduce 82% > 16/01/20 04:12:48 INFO mapreduce.Job: map 100% reduce 83% > 16/01/20 04:14:46 INFO mapreduce.Job: map 100% reduce 84% > 16/01/20 04:26:53 INFO mapreduce.Job: map 100% reduce 85% > 16/01/20 04:29:09 INFO mapreduce.Job: map 100% reduce 100% > 16/01/20 04:29:10 INFO mapreduce.Job: Job job_1453210838763_0011 failed > with state FAILED due to: Task failed task_1453210838763_0011_r_000001 > Job failed as tasks failed. failedMaps:0 failedReduces:1 > > 16/01/20 04:29:11 INFO mapreduce.Job: Counters: 50 > File System Counters > FILE: Number of bytes read=38378343 > FILE: Number of bytes written=115957636 > FILE: Number of read operations=0 > FILE: Number of large read operations=0 > FILE: Number of write operations=0 > HDFS: Number of bytes read=2382 > HDFS: Number of bytes written=0 > HDFS: Number of read operations=2 > HDFS: Number of large read operations=0 > HDFS: Number of write operations=0 > Job Counters > Failed reduce tasks=4 > Launched map tasks=2 > Launched reduce tasks=5 > Data-local map tasks=2 > Total time spent by all maps in occupied slots (ms)=789909 > Total time spent by all reduces in occupied slots (ms)=30215090 > Total time spent by all map tasks (ms)=263303 > Total time spent by all reduce tasks (ms)=6043018 > Total vcore-seconds taken by all map tasks=263303 > Total vcore-seconds taken by all reduce tasks=6043018 > Total megabyte-seconds taken by all map tasks=808866816 > Total megabyte-seconds taken by all reduce tasks=30940252160 > Map-Reduce Framework > Map input records=49929 > Map output records=1777904 > Map output bytes=382773368 > Map output materialized bytes=77228942 > Input split bytes=2382 > Combine input records=0 > Combine output records=0 > Reduce input groups=754170 > Reduce shuffle bytes=38318183 > Reduce input records=881156 > Reduce output records=754170 > Spilled Records=2659060 > Shuffled Maps =2 > Failed Shuffles=0 > Merged Map outputs=2 > GC time elapsed (ms)=17993 > CPU time spent (ms)=819690 > Physical memory (bytes) snapshot=4080136192 > Virtual memory (bytes) snapshot=15234293760 > Total committed heap usage (bytes)=4149739520 > Shuffle Errors > BAD_ID=0 > CONNECTION=0 > IO_ERROR=0 > WRONG_LENGTH=0 > WRONG_MAP=0 > WRONG_REDUCE=0 > File Input Format Counters > Bytes Read=0 > File Output Format Counters > Bytes Written=0 > Exception in thread "main" java.lang.RuntimeException: job failed: > name=[1]update-table, jobid=job_1453210838763_0011 > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111) > at > org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140) > at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Error running: > /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch > updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true 1453230757-13191 -crawlId 1 > Failed with exit value 1. > ****************************************************LOG > END************************************************************************************************ > > Please advise. > > -- > > ------------------------------ > > *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)* > > Central India's largest Technology company. > > *Ensuring the success of our clients and partners through our highly > optimized Technology solutions.* > > www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin < > https://www.linkedin.com/company/cyber-infrastructure-private-limited> | > Offices: *Indore, India.* *Singapore. Silicon Valley, USA*. > > DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the > intended recipient, you should delete this message and are notified that > any disclosure, copying or distribution of this message, or taking any > action based on it, is strictly prohibited by Law. >