Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Can you paste the piece of code!? Thanks Best Regards On Mon, Jun 9, 2014 at 5:24 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Hi Akhil, Plz find the code below. x = sc.textFile(hdfs:///**) x = x.filter(lambda z:z.split(,)[0]!=' ') x = x.filter(lambda z:z.split(,)[3]!=' ') z = x.reduce(add) Thanks Regards, Meethu M On Monday, 9 June 2014 5:52 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the piece of code!? Thanks Best Regards On Mon, Jun 9, 2014 at 5:24 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Have a search online / at the Spark JIRA. This was a known upstream bug in Hadoop. https://issues.apache.org/jira/browse/SPARK-1861 On Mon, Jun 9, 2014 at 7:54 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Hi Sean, Thank you for the fast response. Thanks Regards, Meethu M On Monday, 9 June 2014 6:04 PM, Sean Owen so...@cloudera.com wrote: Have a search online / at the Spark JIRA. This was a known upstream bug in Hadoop. https://issues.apache.org/jira/browse/SPARK-1861 On Mon, Jun 9, 2014 at 7:54 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in HDFS.I have come across the same issue in JIRA at https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved. I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing error.What may be the possible reason that I am getting the same error again? I am using Spark1.0.0 with hadoop 1.2.1. java.lang.ArrayIndexOutOfBoundsException: 90 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:897) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:499) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:330) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:394) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:422) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:176) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Thanks Regards, Meethu M
Re: ArrayIndexOutOfBoundsException when reading bzip2 files
Any idea when they will release it? Also I'm uncertain what we will need to do to fix the shell? Will we have to reinstall spark? or reinstall hadoop? (i'm not a devops so maybe this question sounds silly) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ArrayIndexOutOfBoundsException-when-reading-bzip2-files-tp7237p7263.html Sent from the Apache Spark User List mailing list archive at Nabble.com.