Re: Get S3 Parquet File

Femi Anthony Mon, 27 Feb 2017 06:03:19 -0800

Ok, thanks a lot for the heads up.

Sent from my iPhone


> On Feb 25, 2017, at 10:58 AM, Steve Loughran <ste...@hortonworks.com> wrote:
> 
> 
>> On 24 Feb 2017, at 07:47, Femi Anthony <femib...@gmail.com> wrote:
>> 
>> Have you tried reading using s3n which is a slightly older protocol ? I'm 
>> not sure how compatible s3a is with older versions of Spark.
> 
> I would absolutely not use s3n with a 1.2 GB file.
> 
> There is a WONTFIX JIRA on how it will read to the end of a file when you 
> close a stream, and as seek() closes a stream every seek will read to the end 
> of a file. And as readFully(position, bytes) does a seek either end, every 
> time the Parquet code tries to read a bit of data, 1.3 GV of download: 
> https://issues.apache.org/jira/browse/HADOOP-12376
> 
> That is not going to be fixed, ever. Because it can only be done by upgrading 
> the libraries, and that will simply move new bugs in, lead to different 
> bugreports, etc, etc. All for a piece of code which has be supplanted in the 
> hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ 
> code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, 
> and even the basic metadata operations used when setting up queries.
> 
> For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as  "use s3a"
> 
> 
>> 
>> 
>> Femi
>> 
>>> On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>> Hi Gourav,
>>> 
>>> My answers are below.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com> 
>>>> wrote:
>>>> 
>>>> Can I ask where are you running your CDH? Is it on premise or have you 
>>>> created a cluster for yourself in AWS? Our cluster in on premise in our 
>>>> data center.
>>>> 
> 
> you need to set  up your s3a credentials in core-site, spark-defaults, or 
> rely on spark-submit picking up the submitters AWS env vars a propagating 
> them.
> 
> 
>>>> Also I have really never seen use s3a before, that was used way long 
>>>> before when writing s3 files took a long time, but I think that you are 
>>>> reading it. 
>>>> 
>>>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are 
>>>> lots of apis which are new and the existing ones are being deprecated. 
>>>> Therefore there is a very high chance that you are already working on code 
>>>> which is being deprecated by the SPARK community right now. We use CDH and 
>>>> upgrade with whatever Spark version they include, which is 1.6.0. We are 
>>>> waiting for the move to Spark 2.0/2.1.
> 
> this is in the hadoop codebase, not the spark release. it will be the same 
> irrsepectivel
> 
>>>> 
>>>> And besides that would you not want to work on a platform which is at 
>>>> least 10 times faster What would that be?
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB 
>>>>> Parquet file from AWS S3. We can read the schema and show some data when 
>>>>> the file is loaded into a DataFrame, but when we try to do some 
>>>>> operations, such as count, we get this error below.
>>>>> 
>>>>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
>>>>> credentials from any provider in the chain
>>>>>         at 
>>>>> com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>         at 
>>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>>>>>         at 
>>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>>>>>         at 
>>>>> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>>>>>         at 
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>>>>>         at 
>>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>>>>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>>>>         at 
>>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>>>>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>>>>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>>>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>>>>         at 
>>>>> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>>>>>         at 
>>>>> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>>>>>         at 
>>>>> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>>>>>         at 
>>>>> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>>>>>         at 
>>>>> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at 
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at 
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at 
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>>>         at 
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>         at 
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>         at 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> 
> 
> 
> This stack trace implies that its an executor failing to authenticate with 
> AWS and so read the bucket data. What may be happening is that code running 
> in your client is being authenticated, but work done to the authenticator 
> RDD/dataframe isn't
> 
> 
> 1. try cranking up the logging in org.apache.hadoop.fs.s3a and 
> com.cloudera.com.amazonaws, though all the auth code there deliberately 
> avoids printing out credentials, so isn't that great for debugging things.
> 2. make sure that the fs.s3a secret and auth keys are getting down.
> 
> 
> For troubleshooting S3A, start with
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
> 
> and/or
> https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html
> 
> 
>

Re: Get S3 Parquet File

Reply via email to