Re: parallel processing with JDBC

Ashok Kumar Sun, 14 Aug 2016 13:44:48 -0700

Thank you very much sir.
I forgot to mention that two of these Oracle tables are range partitioned. In 
that case what would be the optimum number of partitions if you can share?
Warmest


    On Sunday, 14 August 2016, 21:37, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:
 

 If you have primary keys on these tables then you can parallelise the process 
reading data.
You have to be careful not to set the number of partitions too many. Certainly 
there is a balance between the number of partitions supplied to JDBC and the 
load on the network and the source DB.
Assuming that your underlying table has primary key ID, then this will create 
20 parallel processes to Oracle DB
 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> _ORACLEserver,
 "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
 "partitionColumn" -> "ID",
 "lowerBound" -> "1",
 "upperBound" -> "maxID",
 "numPartitions" -> "20",
 "user" -> _username,
 "password" -> _password)).load
assuming your upper bound on ID is maxID

This will open multiple connections to RDBMS, each getting a subset of data 
that you want.
You need to test it to ensure that you get the numPartitions optimum and you 
don't overload any component.
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:

Hi,
There are 4 tables ranging from 10 million to 100 million rows but they all 
have primary keys.
The network is fine but our Oracle is RAC and we can only connect to a 
designated Oracle node (where we have a DQ account only).
We have a limited time window of few hours to get the required data out.
Thanks 

    On Sunday, 14 August 2016, 21:07, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:
 

 How big are your tables and is there any issue with the network between your 
Spark nodes and your Oracle DB that adds to issues?
HTH
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= 
AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote:

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you

Re: parallel processing with JDBC

Reply via email to