Re: Using HiveContext.set in multipul threads

Silvio Fiorito Tue, 24 May 2016 05:13:02 -0700

If you’re using DataFrame API you can achieve that by simply using (or not) the 
“partitionBy” method on the DataFrameWriter:


val originalDf = ….

val df1 = originalDf….
val df2 = originalDf…

df1.write.partitionBy(”col1”).save(…)

df2.write.save(…)

From: Amir Gershman <am...@fb.com>
Date: Tuesday, May 24, 2016 at 7:01 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Using HiveContext.set in multipul threads

Hi,

I have a DataFrame I compute from a long chain of transformations.
I cache it, and then perform two additional transformations on it.
I use two Futures - each Future will insert the content of one of the above 
Dataframe to a different hive table.
One Future must SET hive.exec.dynamic.partition=true and the other must set it 
to false.



How can I run both INSERT commands in parallel, but guarantee each runs with 
its own settings?



If I don't use the same HiveContext then the initial long chain of 
transformations which I cache is not reusable between HiveContexts. If I use 
the same HiveContext, race conditions between threads my cause one INSERT to 
execute with the wrong config.

Re: Using HiveContext.set in multipul threads

Reply via email to