Ok, thanks a lot for the heads up. Sent from my iPhone
> On Feb 25, 2017, at 10:58 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > >> On 24 Feb 2017, at 07:47, Femi Anthony <femib...@gmail.com> wrote: >> >> Have you tried reading using s3n which is a slightly older protocol ? I'm >> not sure how compatible s3a is with older versions of Spark. > > I would absolutely not use s3n with a 1.2 GB file. > > There is a WONTFIX JIRA on how it will read to the end of a file when you > close a stream, and as seek() closes a stream every seek will read to the end > of a file. And as readFully(position, bytes) does a seek either end, every > time the Parquet code tries to read a bit of data, 1.3 GV of download: > https://issues.apache.org/jira/browse/HADOOP-12376 > > That is not going to be fixed, ever. Because it can only be done by upgrading > the libraries, and that will simply move new bugs in, lead to different > bugreports, etc, etc. All for a piece of code which has be supplanted in the > hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ > code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, > and even the basic metadata operations used when setting up queries. > > For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as "use s3a" > > >> >> >> Femi >> >>> On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bbuil...@gmail.com> wrote: >>> Hi Gourav, >>> >>> My answers are below. >>> >>> Cheers, >>> Ben >>> >>> >>>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com> >>>> wrote: >>>> >>>> Can I ask where are you running your CDH? Is it on premise or have you >>>> created a cluster for yourself in AWS? Our cluster in on premise in our >>>> data center. >>>> > > you need to set up your s3a credentials in core-site, spark-defaults, or > rely on spark-submit picking up the submitters AWS env vars a propagating > them. > > >>>> Also I have really never seen use s3a before, that was used way long >>>> before when writing s3 files took a long time, but I think that you are >>>> reading it. >>>> >>>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are >>>> lots of apis which are new and the existing ones are being deprecated. >>>> Therefore there is a very high chance that you are already working on code >>>> which is being deprecated by the SPARK community right now. We use CDH and >>>> upgrade with whatever Spark version they include, which is 1.6.0. We are >>>> waiting for the move to Spark 2.0/2.1. > > this is in the hadoop codebase, not the spark release. it will be the same > irrsepectivel > >>>> >>>> And besides that would you not want to work on a platform which is at >>>> least 10 times faster What would that be? >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB >>>>> Parquet file from AWS S3. We can read the schema and show some data when >>>>> the file is loaded into a DataFrame, but when we try to do some >>>>> operations, such as count, we get this error below. >>>>> >>>>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS >>>>> credentials from any provider in the chain >>>>> at >>>>> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) >>>>> at >>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779) >>>>> at >>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107) >>>>> at >>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070) >>>>> at >>>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239) >>>>> at >>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711) >>>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97) >>>>> at >>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748) >>>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730) >>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) >>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) >>>>> at >>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385) >>>>> at >>>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162) >>>>> at >>>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) >>>>> at >>>>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180) >>>>> at >>>>> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126) >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>>>> at >>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>>>> at >>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>>>> at >>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >>>>> at >>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>> at >>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> > > > This stack trace implies that its an executor failing to authenticate with > AWS and so read the bucket data. What may be happening is that code running > in your client is being authenticated, but work done to the authenticator > RDD/dataframe isn't > > > 1. try cranking up the logging in org.apache.hadoop.fs.s3a and > com.cloudera.com.amazonaws, though all the auth code there deliberately > avoids printing out credentials, so isn't that great for debugging things. > 2. make sure that the fs.s3a secret and auth keys are getting down. > > > For troubleshooting S3A, start with > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md > > and/or > https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html > > >