Hello,
Is there any API to insert data into a single partition of a table?
Let's say I have a table with 2 columns (col_a, col_b) and a partition by
date.
After doing some computation for a specific date, I have a DataFrame with 2
columns (col_a, col_b) which I would like to insert into a specifi
I don't think there's api for that, but think it is reasonable and helpful
for ETL.
As a workaround you can first register your dataframe as temp table, and
use sql to insert to the static partition.
On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote:
> Hello,
>
> Is there any API to insert d
Hi
you can try:
if your table under location “/test/table/“ on HDFS
and has partitions:
“/test/table/dt=2012”
“/test/table/dt=2013”
df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
> On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote:
>
> df.write.partitionBy("date").i
As long as all your data is being inserted by Spark , hence using the same
hash partitioner, what Fengdong mentioned should work.
On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu
wrote:
> Hi
> you can try:
>
> if your table under location “/test/table/“ on HDFS
> and has partitions:
>
> “/test/tabl
you might also coalesce to 1 (or some small number) before writing to avoid
creating a lot of files in that partition if you know that there is not a
ton of data.
On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote:
> As long as all your data is being inserted by Spark , hence using the same
> h
Thanks all for your reply!
I tested both approaches: registering the temp table then executing SQL vs.
saving to HDFS filepath directly. The problem with the second approach is
that I am inserting data into a Hive table, so if I create a new partition
with this method, Hive metadata is not updated
>
> Follow up question in this case: what is the cost of registering a temp
> table? Is there a limit to the number of temp tables that can be registered
> by Spark context?
>
It is pretty cheap. Just an entry in an in-memory hashtable to a query
plan (similar to a view).