Hi Hao, Could you please post the corresponding code? Are you using textFile or sc.parallelize?
On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao <hao....@finra.org> wrote: > When I tried to read multiple bz2 files from s3, I have the following > warning messages. What is the problem here? > > > > 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343 > > at > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014) > > at > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829) > > at > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504) > > at > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) > > at > org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399) > > at > org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483) > > at java.io.InputStream.read(InputStream.java:101) > > at > org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130) > > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > > at > org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159) > > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) > > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) > > at > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) > > at > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) > > at > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553) > > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) > > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) > > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) > > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > > at org.apache.spark.scheduler.Task.run(Task.scala:88) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Confidentiality Notice:: This email, including attachments, may include > non-public, proprietary, confidential or legally privileged information. If > you are not an intended recipient or an authorized agent of an intended > recipient, you are hereby notified that any dissemination, distribution or > copying of the information contained in or transmitted with this e-mail is > unauthorized and strictly prohibited. If you have received this email in > error, please notify the sender by replying to this message and permanently > delete this e-mail, its attachments, and any copies of it immediately. You > should not retain, copy or use this e-mail or any attachment for any > purpose, nor disclose all or any part of the contents to any other person. > Thank you. >