Re: Get S3 Parquet File

Benjamin Kim Thu, 23 Feb 2017 23:19:01 -0800

Hi Gourav,

My answers are below.


Cheers,
Ben


> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> Can I ask where are you running your CDH? Is it on premise or have you 
> created a cluster for yourself in AWS? Our cluster in on premise in our data 
> center.
> 
> Also I have really never seen use s3a before, that was used way long before 
> when writing s3 files took a long time, but I think that you are reading it. 
> 
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
> lots of apis which are new and the existing ones are being deprecated. 
> Therefore there is a very high chance that you are already working on code 
> which is being deprecated by the SPARK community right now. We use CDH and 
> upgrade with whatever Spark version they include, which is 1.6.0. We are 
> waiting for the move to Spark 2.0/2.1.
> 
> And besides that would you not want to work on a platform which is at least 
> 10 times faster What would that be?
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
> file from AWS S3. We can read the schema and show some data when the file is 
> loaded into a DataFrame, but when we try to do some operations, such as 
> count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
> credentials from any provider in the chain
>         at 
> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>         at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>         at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>         at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>         at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>         at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>         at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>         at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Re: Get S3 Parquet File

Reply via email to