Re: Get S3 Parquet File

Benjamin Kim Fri, 24 Feb 2017 09:09:18 -0800

Gourav,

I’ll start experimenting with Spark 2.1 to see if this works.


Cheers,
Ben


> On Feb 24, 2017, at 5:46 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> Hi Benjamin,
> 
> First of all fetching data from S3 while writing a code in on premise system 
> is a very bad idea. You might want to first copy the data in to local HDFS 
> before running your code. Ofcourse this depends on the volume of data and 
> internet speed that you have.
> 
> The platform which makes your data at least 10 times faster is SPARK 2.1. And 
> trust me you do not want to be writing code which needs you to update it once 
> again in 6 months because newer versions of SPARK now find it deprecated.
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Gourav,
> 
> My answers are below.
> 
> Cheers,
> Ben
> 
> 
>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com 
>> <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> Can I ask where are you running your CDH? Is it on premise or have you 
>> created a cluster for yourself in AWS? Our cluster in on premise in our data 
>> center.
>> 
>> Also I have really never seen use s3a before, that was used way long before 
>> when writing s3 files took a long time, but I think that you are reading it. 
>> 
>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
>> lots of apis which are new and the existing ones are being deprecated. 
>> Therefore there is a very high chance that you are already working on code 
>> which is being deprecated by the SPARK community right now. We use CDH and 
>> upgrade with whatever Spark version they include, which is 1.6.0. We are 
>> waiting for the move to Spark 2.0/2.1.
>> 
>> And besides that would you not want to work on a platform which is at least 
>> 10 times faster What would that be?
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
>> file from AWS S3. We can read the schema and show some data when the file is 
>> loaded into a DataFrame, but when we try to do some operations, such as 
>> count, we get this error below.
>> 
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
>> credentials from any provider in the chain
>>         at 
>> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>         at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>>         at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>>         at 
>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>>         at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>>         at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at 
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at 
>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>>         at 
>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>>         at 
>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>>         at 
>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>>         at 
>> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>         at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> 
>> Can anyone help?
>> 
>> Cheers,
>> Ben
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
> 
>

Re: Get S3 Parquet File

Reply via email to