Gourav, I’ll start experimenting with Spark 2.1 to see if this works.
Cheers, Ben > On Feb 24, 2017, at 5:46 AM, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi Benjamin, > > First of all fetching data from S3 while writing a code in on premise system > is a very bad idea. You might want to first copy the data in to local HDFS > before running your code. Ofcourse this depends on the volume of data and > internet speed that you have. > > The platform which makes your data at least 10 times faster is SPARK 2.1. And > trust me you do not want to be writing code which needs you to update it once > again in 6 months because newer versions of SPARK now find it deprecated. > > > Regards, > Gourav Sengupta > > > > On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Gourav, > > My answers are below. > > Cheers, > Ben > > >> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com >> <mailto:gourav.sengu...@gmail.com>> wrote: >> >> Can I ask where are you running your CDH? Is it on premise or have you >> created a cluster for yourself in AWS? Our cluster in on premise in our data >> center. >> >> Also I have really never seen use s3a before, that was used way long before >> when writing s3 files took a long time, but I think that you are reading it. >> >> Anyideas why you are not migrating to Spark 2.1, besides speed, there are >> lots of apis which are new and the existing ones are being deprecated. >> Therefore there is a very high chance that you are already working on code >> which is being deprecated by the SPARK community right now. We use CDH and >> upgrade with whatever Spark version they include, which is 1.6.0. We are >> waiting for the move to Spark 2.0/2.1. >> >> And besides that would you not want to work on a platform which is at least >> 10 times faster What would that be? >> >> Regards, >> Gourav Sengupta >> >> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet >> file from AWS S3. We can read the schema and show some data when the file is >> loaded into a DataFrame, but when we try to do some operations, such as >> count, we get this error below. >> >> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS >> credentials from any provider in the chain >> at >> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) >> at >> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) >> at >> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) >> at >> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070) >> at >> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239) >> at >> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711) >> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97) >> at >> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748) >> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730) >> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) >> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) >> at >> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385) >> at >> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162) >> at >> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) >> at >> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180) >> at >> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> Can anyone help? >> >> Cheers, >> Ben >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> <mailto:user-unsubscr...@spark.apache.org> >> >> > >