Hi,

I have a table in hive with below schema
emp_id:int
emp_name:string

I have created data frame from above hive table

df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()

After above code is run I see that data is sorted properly on emp_id

After this I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties,
mode="overwrite")

When I see the Oracle table I see that ordering is not preserved and data
is populated in random order

As per my understanding, This is happening because of multiple executor
processes running at the same time on every data partitions and sorting
applied through query is been applied on specific partition and when
multiple processes writing data to Oracle at the same time the result table
ordering is distorted

I further tried to repartition the data to just one partition(Which is not
ideal solution) and post writing the data to oracle the sorting worked
properly

Is there any way to write sorted data to RDBMS from SPARK

Thanks,
Abhijeet

Reply via email to