Re: Get S3 Parquet File

Gourav Sengupta Thu, 23 Feb 2017 22:57:52 -0800

Can I ask where are you running your CDH? Is it on premise or have you
created a cluster for yourself in AWS?


Also I have really never seen use s3a before, that was used way long before
when writing s3 files took a long time, but I think that you are reading
it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are
lots of apis which are new and the existing ones are being deprecated.
Therefore there is a very high chance that you are already working on code
which is being deprecated by the SPARK community right now.

And besides that would you not want to work on a platform which is at least
10 times faster

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
> Parquet file from AWS S3. We can read the schema and show some data when
> the file is loaded into a DataFrame, but when we try to do some operations,
> such as count, we get this error below.
>
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
> credentials from any provider in the chain
>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
> getCredentials(AWSCredentialsProviderChain.java:117)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> invoke(AmazonS3Client.java:3779)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> headBucket(AmazonS3Client.java:1107)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> doesBucketExist(AmazonS3Client.java:1070)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
> S3AFileSystem.java:239)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:2711)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:2748)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>         at parquet.hadoop.ParquetFileReader.readFooter(
> ParquetFileReader.java:385)
>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
> ParquetRecordReader.java:162)
>         at parquet.hadoop.ParquetRecordReader.initialize(
> ParquetRecordReader.java:145)
>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>
> (SqlNewHadoopRDD.scala:180)
>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(
> SqlNewHadoopRDD.scala:126)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>         at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:229)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> Can anyone help?
>
> Cheers,
> Ben
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Get S3 Parquet File

Reply via email to