Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread Bijay Pathak
ucketing = true; ") >>> sqlContext.sql("set hive.enforce.sorting = true; ") >>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >>> dtPartitioner STRING) sto

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak
Hello, Looks like you are hitting this: https://issues.apache.org/jira/browse/HIVE-11940. Thanks, Bijay On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh wrote: > cam you provide a code snippet of how you are populating the target table > from temp table. > > > HTH

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Bijay Pathak
Sorry, for the confusion this was supposed to be answer for another thread. Bijay On Sat, Apr 30, 2016 at 2:37 PM, Bijay Kumar Pathak wrote: > Hi, > > I was facing the same issue on Spark 1.6. My data size was around 100 GB > and was writing in the partition Hive table. > > I

Fwd: Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table

Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Bijay Pathak
Hi, I have written the UDF for doing same in pyspark DataFrame since some of my dates are before unix standard time epoch of 1/1/1970. I have more than 250 columns and applying custom date_format UDF to more than 50 columns. I am getting OOM error and poor performance because of UDF. What's your

Re: multiple tables for join

2016-03-24 Thread Bijay Pathak
Hi, Can you elaborate what's the issues you are facing, I am doing the similar kind of join so I may be able to provide you with some suggestions and pointers. Thanks, BIjay On Thu, Mar 24, 2016 at 5:12 AM, pseudo oduesp wrote: > hi , i spent two months of my times to

Re: adding rows to a DataFrame

2016-03-11 Thread Bijay Pathak
Here is another way you can achieve that(in Python): base_df.withColumn("column_name","column_expression_for_new_column") # To add new row create the data frame containing the new row and do the unionAll() base_df.unionAll(new_df) # Another approach convert to rdd add required fields and convert

Getting all field value as Null while reading Hive Table with Partition

2016-01-20 Thread Bijay Pathak
Hello, I am getting all the value of field as NULL while reading Hive Table with Partition in SPARK 1.5.0 running on CDH5.5.1 with YARN (Dynamic Allocation). Below is the command I used in Spark_Shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) val

Base metrics for Spark Benchmarking.

2015-04-16 Thread Bijay Pathak
Hello, We wanted to tune the Spark running on YARN cluster.The Spark History Server UI shows lots of parameters like: - GC time - Task Duration - Shuffle R/W - Shuffle Spill (Memory/Disk) - Serialization Time (Task/Result) - Scheduler Delay Among the above metrics, which are

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
from older map tasks to memory? On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak bijay.pat...@cloudwick.com wrote: The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce

Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
Hello, I am running TeraSort https://github.com/ehiggs/spark-terasort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was