Re: Saving Parquet files to S3

2016-06-10 Thread Bijay Kumar Pathak
Hi Ankur, I also tried setting a property to write parquet file size of 256MB. I am using pyspark below is how I set the property but it's not working for me. How did you set the property? spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456)

Re: Error joining dataframes

2016-05-17 Thread Bijay Kumar Pathak
Hi, Try this one: df_join = df1.*join*(df2, 'Id', "fullouter") Thanks, Bijay On Tue, May 17, 2016 at 9:39 AM, ram kumar wrote: > Hi, > > I tried to join two dataframe > > df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter") > >

Disable parquet metadata summary in

2016-05-05 Thread Bijay Kumar Pathak
Hi, How can we disable writing _common_metdata while saving Data Frame in parquet format in PySpark. I tried to set the property using below command but didn't helped. sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata", "false") Thanks, Bijay

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
64-bit JVM with less than 32G heap, you might want > to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow > generating more than 2^31-1 number of arrays, you might have to rethink > your options. > > [1] https://spark.apache.org/docs/latest/tuning.html > > On Wed,

SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Thanks Ted. This looks like the issue since I am running it in EMR and the Hive version is 1.0.0. Thanks, Bijay On Wed, May 4, 2016 at 10:29 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Looks like you were hitting HIVE-11940 > > On Wed, May 4, 2016 at 10:02 AM, Bijay Kuma

Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Hello, I am writing Dataframe of around 60+ GB into partitioned Hive Table using hiveContext in parquet format. The Spark insert overwrite jobs completes in a reasonable amount of time around 20 minutes. But the job is taking a huge amount of time more than 2 hours to copy data from .hivestaging

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Bijay Kumar Pathak
Hi, I was facing the same issue on Spark 1.6. My data size was around 100 GB and was writing in the partition Hive table. I was able to solve this issue by starting from 6G of memory and reaching upto 15GB of memory per executor with overhead of 2GB and partitioning the DataFrame before doing

Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Bijay Kumar Pathak
existing data in the table or partition > >- unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0 ><https://issues.apache.org/jira/browse/HIVE-2612>). > > > > Thanks. > > Zhan Zhang > > On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak &l

Spark SQL insert overwrite table not showing all the partition.

2016-04-21 Thread Bijay Kumar Pathak
Hi, I have a job which writes to the Hive table with dynamic partition. Inside the job, I am writing into the table two-time but I am only seeing the partition with last write although I can see in the Spark UI it is processing data fro both the partition. Below is the query I am using to write

Reading conf file in Pyspark in cluster mode

2016-04-16 Thread Bijay Kumar Pathak
Hello, I have spark jobs packaged in zipped and deployed using cluster mode in AWS EMR. The job has to read conf file packaged with the zip under the resources directory. I can read the conf file in the client mode but not in cluster mode. How do I read the conf file packaged in the zip while

Re: Connection closed Exception.

2016-04-11 Thread Bijay Kumar Pathak
wrote: > Try increasing the memory allocated for this job. > > Sent from Outlook for iPhone <https://aka.ms/wp8k5y> > > > > > On Sun, Apr 10, 2016 at 9:12 PM -0700, "Bijay Kumar Pathak" < > bkpat...@mtu.edu> wrote: > > Hi, >> >>

Connection closed Exception.

2016-04-10 Thread Bijay Kumar Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the following things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table