Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel
I tried following to explicitly specify partition columns in sql statement and also tried different cases (upper and lower) fro partition columns. insert overwrite table $tableName PARTITION(P1, P2) select A, B, C, P1, P2 from updateTable. Still getting: Caused by:

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel
Thanks Koert. I'll check that out when we can update to 2.3 Meanwhile, I am trying hive sql (INSERT OVERWRITE) statement to insert overwrite multiple partitions. (without loosing existing ones) It's giving me issues around partition columns. dataFrame.createOrReplaceTempView("updateTable")

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Koert Kuipers
this works for dataframes with spark 2.3 by changing a global setting, and will be configurable per write in 2.4 see: https://issues.apache.org/jira/browse/SPARK-20236 https://issues.apache.org/jira/browse/SPARK-24860 On Wed, Aug 1, 2018 at 3:11 PM, Nirav Patel wrote: > Hi Peay, > > Have you

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Nirav Patel
Hi Peay, Have you find better solution yet? I am having same issue. Following says it works with spark 2.1 onward but only when you use sqlContext and not Dataframe https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a Thanks, Nirav On Mon, Oct 2, 2017 at 4:37

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-10-02 Thread Pavel Knoblokh
If your processing task inherently processes input data by month you may want to "manually" partition the output data by month as well as by day, that is to save it with a file name including the given month, i.e. "dataset.parquet/month=01". Then you will be able to use the overwrite mode with

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread Vadim Semenov
As alternative: checkpoint the dataframe, collect days, and then delete corresponding directories using hadoop FileUtils, then write the dataframe On Fri, Sep 29, 2017 at 10:31 AM, peay wrote: > Hello, > > I am trying to use

Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread peay
Hello, I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a dataset while splitting by day. I would like to run a Spark job to process, e.g., a month: dataset.parquet/day=2017-01-01/... ... and then run another Spark job to add another month using the same