SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Isabelle Phan
Hello, Is there any API to insert data into a single partition of a table? Let's say I have a table with 2 columns (col_a, col_b) and a partition by date. After doing some computation for a specific date, I have a DataFrame with 2 columns (col_a, col_b) which I would like to insert into a specifi

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang
I don't think there's api for that, but think it is reasonable and helpful for ETL. As a workaround you can first register your dataframe as temp table, and use sql to insert to the static partition. On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote: > Hello, > > Is there any API to insert d

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").i

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra
As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has partitions: > > “/test/tabl

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid creating a lot of files in that partition if you know that there is not a ton of data. On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote: > As long as all your data is being inserted by Spark , hence using the same > h

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not updated

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-05 Thread Michael Armbrust
> > Follow up question in this case: what is the cost of registering a temp > table? Is there a limit to the number of temp tables that can be registered > by Spark context? > It is pretty cheap. Just an entry in an in-memory hashtable to a query plan (similar to a view).