Dear Spark users,

 

I am trying to figure out whether Spark is a good tool for my use case.

I'm trying to ETL a subset of a customers/orders database from Oracle to
JSON. Rougly 3-5% of the overall customers table.

 

I tried to use the Spark JDBC datasource but it ends up fetching the
entire customers and orders table to one executor. I read about the
partitionColumn, lowerBound and upperBound options. Could they be used
somehow to distribute the load across a set of executors while also
filtering out at the source customers that are not part of my subset?

 

Or would it better to parallelize the subset of customers to export and
have a map operation that will query the Oracle Database to transform the
customer ID to a JSON object containing customer and orders details?

 

Or is Spark not suitable for this kind of processes?

 

Just asking for guidance in order to not lose too much time in wrong
directions. Thanks for your help!

 

Best,

 

Patrick

 

Reply via email to