Parallel dynamic partitioning producing duplicated data

Mehdi Ben Haj Abbes Wed, 30 Nov 2016 07:13:33 -0800

Hi Folks,



I have a spark job reading a csv file into a dataframe. I register that
dataframe as a tempTable then I’m writing that dataframe/tempTable to hive
external table (using parquet format for storage)

I’m using this kind of command :

hiveContext.sql(*"INSERT INTO TABLE t PARTITION(statPart='string_value',
dynPart) SELECT * FROM tempTable"*);



Through this integration, for each csv line I will get a parquet
line/record. So if I count the csv files lines total number it must equals
the count of the parquet dataset produced.



I launch in parallel 20 of these jobs (to take advantage of idle
resources). Sometimes I get parquet count randomly slightly bigger than csv
count (mainly the difference concern one dynamic partition and one csv file
that has been integrated) but if I launch these job sequentially one after
the other I never get the problem of the different count.



Does anyone have  any idea about the cause of this problem (different
count). For me it is obvious that the parallel execution is causing the
issue and strongly believe that it happens when moving data from
hive.exec.stagingdir.prefix dir  to the hive final table location on hdfs



Thanks in advance.

Parallel dynamic partitioning producing duplicated data

Reply via email to