Hi Folks,
I have a spark job reading a csv file into a dataframe. I register that dataframe as a tempTable then I’m writing that dataframe/tempTable to hive external table (using parquet format for storage) I’m using this kind of command : hiveContext.sql(*"INSERT INTO TABLE t PARTITION(statPart='string_value', dynPart) SELECT * FROM tempTable"*); Through this integration, for each csv line I will get a parquet line/record. So if I count the csv files lines total number it must equals the count of the parquet dataset produced. I launch in parallel 20 of these jobs (to take advantage of idle resources). Sometimes I get parquet count randomly slightly bigger than csv count (mainly the difference concern one dynamic partition and one csv file that has been integrated) but if I launch these job sequentially one after the other I never get the problem of the different count. Does anyone have any idea about the cause of this problem (different count). For me it is obvious that the parallel execution is causing the issue and strongly believe that it happens when moving data from hive.exec.stagingdir.prefix dir to the hive final table location on hdfs Thanks in advance.