Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS?
Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. And besides that would you not want to work on a platform which is at least 10 times faster Regards, Gourav Sengupta On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB > Parquet file from AWS S3. We can read the schema and show some data when > the file is loaded into a DataFrame, but when we try to do some operations, > such as count, we get this error below. > > com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS > credentials from any provider in the chain > at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain. > getCredentials(AWSCredentialsProviderChain.java:117) > at com.cloudera.com.amazonaws.services.s3.AmazonS3Client. > invoke(AmazonS3Client.java:3779) > at com.cloudera.com.amazonaws.services.s3.AmazonS3Client. > headBucket(AmazonS3Client.java:1107) > at com.cloudera.com.amazonaws.services.s3.AmazonS3Client. > doesBucketExist(AmazonS3Client.java:1070) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize( > S3AFileSystem.java:239) > at org.apache.hadoop.fs.FileSystem.createFileSystem( > FileSystem.java:2711) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal( > FileSystem.java:2748) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) > at parquet.hadoop.ParquetFileReader.readFooter( > ParquetFileReader.java:385) > at parquet.hadoop.ParquetRecordReader.initializeInternalReader( > ParquetRecordReader.java:162) > at parquet.hadoop.ParquetRecordReader.initialize( > ParquetRecordReader.java:145) > at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init> > (SqlNewHadoopRDD.scala:180) > at org.apache.spark.rdd.SqlNewHadoopRDD.compute( > SqlNewHadoopRDD.scala:126) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ShuffleMapTask.runTask( > ShuffleMapTask.scala:73) > at org.apache.spark.scheduler.ShuffleMapTask.runTask( > ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run( > Executor.scala:229) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Can anyone help? > > Cheers, > Ben > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >