RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Xiangrui,

For the following problem, I found out an issue ticket you posted before 
https://issues.apache.org/jira/browse/HADOOP-10614

I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any 
suggestion on how to fix it?

Thanks
Hao


From: Lin, Hao [mailto:hao@finra.org]
Sent: Tuesday, February 02, 2016 10:38 AM
To: Robert Collich; user
Subject: RE: try to read multiple bz2 files in s3

Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this 

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: try to read multiple bz2 files in s3

2016-02-01 Thread Robert Collich
Hi Hao,

Could you please post the corresponding code? Are you using textFile or
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao  wrote:

> When I tried to read multiple bz2 files from s3, I have the following
> warning messages. What is the problem here?
>
>
>
> 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
>
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
>
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
>
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
>
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
>
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
>
> at
> org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
>
> at java.io.InputStream.read(InputStream.java:101)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
>
> at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
>
> at
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>