Hello Everyone,

My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS.

*Table in Oracle:*
1) no primary key.
2) Has 404 columns.
3) Has 200,800,000 rows.

*Spark SQL:*
In my Spark SQL I want to read the data into n number of partitions in
parallel, for which I need to provide 'partition column','lowerBound',
'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has
no such column to satisfy this need(Highly Skewed), because of it, if the
numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
finishes in 20 mins and the last one takes forever.

Is there anything I could do to distribute the data evenly into partitions?
Can we set any fake query to orchestrate this pull process, as we do in
SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS MIN_MOD_VAL,
CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?

Any pointers are appreciated.

Thanks for your time.

~ Ajay

Reply via email to