Re: Get S3 Parquet File

Femi Anthony Thu, 23 Feb 2017 23:48:07 -0800

Have you tried reading using s3n which is a slightly older protocol ? I'm
not sure how compatible s3a is with older versions of Spark.



Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(
>> SqlNewHadoopRDD.scala:180)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Re: Get S3 Parquet File

Reply via email to