Re: Get S3 Parquet File

Steve Loughran Sat, 25 Feb 2017 07:59:22 -0800

On 24 Feb 2017, at 07:47, Femi Anthony 
<femib...@gmail.com<mailto:femib...@gmail.com>> wrote:


Have you tried reading using s3n which is a slightly older protocol ? I'm not 
sure how compatible s3a is with older versions of Spark.

I would absolutely not use s3n with a 1.2 GB file.

There is a WONTFIX JIRA on how it will read to the end of a file when you close 
a stream, and as seek() closes a stream every seek will read to the end of a 
file. And as readFully(position, bytes) does a seek either end, every time the 
Parquet code tries to read a bit of data, 1.3 GV of download: 
https://issues.apache.org/jira/browse/HADOOP-12376

That is not going to be fixed, ever. Because it can only be done by upgrading 
the libraries, and that will simply move new bugs in, lead to different 
bugreports, etc, etc. All for a piece of code which has be supplanted in the 
hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ 
code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, 
and even the basic metadata operations used when setting up queries.

For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as  "use s3a"




Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Hi Gourav,

My answers are below.

Cheers,
Ben


On Feb 23, 2017, at 10:57 PM, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:

Can I ask where are you running your CDH? Is it on premise or have you created 
a cluster for yourself in AWS? Our cluster in on premise in our data center.


you need to set  up your s3a credentials in core-site, spark-defaults, or rely 
on spark-submit picking up the submitters AWS env vars a propagating them.


Also I have really never seen use s3a before, that was used way long before 
when writing s3 files took a long time, but I think that you are reading it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots 
of apis which are new and the existing ones are being deprecated. Therefore 
there is a very high chance that you are already working on code which is being 
deprecated by the SPARK community right now. We use CDH and upgrade with 
whatever Spark version they include, which is 1.6.0. We are waiting for the 
move to Spark 2.0/2.1.

this is in the hadoop codebase, not the spark release. it will be the same 
irrsepectivel


And besides that would you not want to work on a platform which is at least 10 
times faster What would that be?

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
file from AWS S3. We can read the schema and show some data when the file is 
loaded into a DataFrame, but when we try to do some operations, such as count, 
we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain
        at 
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
        at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
        at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
        at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
        at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
        at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
        at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
        at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)



This stack trace implies that its an executor failing to authenticate with AWS 
and so read the bucket data. What may be happening is that code running in your 
client is being authenticated, but work done to the authenticator RDD/dataframe 
isn't


1. try cranking up the logging in org.apache.hadoop.fs.s3a and 
com.cloudera.com.amazonaws, though all the auth code there deliberately avoids 
printing out credentials, so isn't that great for debugging things.
2. make sure that the fs.s3a secret and auth keys are getting down.


For troubleshooting S3A, start with
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

and/or
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html

Re: Get S3 Parquet File

Reply via email to