Re: Get S3 Parquet File

2017-02-27 Thread Femi Anthony
Ok, thanks a lot for the heads up. Sent from my iPhone > On Feb 25, 2017, at 10:58 AM, Steve Loughran wrote: > > >> On 24 Feb 2017, at 07:47, Femi Anthony wrote: >> >> Have you tried reading using s3n which is a slightly older protocol ? I'm >>

Re: Get S3 Parquet File

2017-02-25 Thread Steve Loughran
On 24 Feb 2017, at 07:47, Femi Anthony > wrote: Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. I would absolutely not use s3n with a 1.2 GB file. There is a

Re: Get S3 Parquet File

2017-02-24 Thread Benjamin Kim
Gourav, I’ll start experimenting with Spark 2.1 to see if this works. Cheers, Ben > On Feb 24, 2017, at 5:46 AM, Gourav Sengupta > wrote: > > Hi Benjamin, > > First of all fetching data from S3 while writing a code in on premise system > is a very bad idea. You

Re: Get S3 Parquet File

2017-02-24 Thread Gourav Sengupta
Hi Benjamin, First of all fetching data from S3 while writing a code in on premise system is a very bad idea. You might want to first copy the data in to local HDFS before running your code. Ofcourse this depends on the volume of data and internet speed that you have. The platform which makes

Re: Get S3 Parquet File

2017-02-23 Thread Femi Anthony
Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. Femi On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim wrote: > Hi Gourav, > > My answers are below. > > Cheers, > Ben > > > On Feb 23, 2017,

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Hi Gourav, My answers are below. Cheers, Ben > On Feb 23, 2017, at 10:57 PM, Gourav Sengupta > wrote: > > Can I ask where are you running your CDH? Is it on premise or have you > created a cluster for yourself in AWS? Our cluster in on premise in our data >

Re: Get S3 Parquet File

2017-02-23 Thread Gourav Sengupta
Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. Anyideas why you are not

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
Aakash, Here is a code snippet for the keys. val accessKey = “---" val secretKey = “---" val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3a.access.key", accessKey) hadoopConf.set("fs.s3a.secret.key", secretKey) hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)

Re: Get S3 Parquet File

2017-02-23 Thread Aakash Basu
Hey, Please recheck your access key and secret key being used to fetch the parquet file. It seems to be a credential error. Either mismatch/load. If load, then first use it directly in code and see if the issue resolves, then it can be hidden and read from Input Params. Thanks, Aakash. On

Get S3 Parquet File

2017-02-23 Thread Benjamin Kim
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.