Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Mich Talebzadeh
Hi, If we are talking about billions of records and depending on your network and RDBMs with parallel connections, from my experience it works OK for Dimension tables of moderate size, in that you can have parallel connections to RDBMS (assuming the RDBMS has a primary key/unique column) to

Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Teng Qiu
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional data source during the data processing, main goal of spark is still analyzing data from HDFS-like file system. to use spark as a data integration tool to transfer billions of records from RDBMS to HDFS etc. could work, but

Fwd: jdbcRDD for data ingestion from RDBMS

2016-10-17 Thread Ninad Shringarpure
Hi Team, One of my client teams is trying to see if they can use Spark to source data from RDBMS instead of Sqoop. Data would be substantially large in the order of billions of records. I am not sure reading the documentations whether jdbcRDD by design is going to be able to scale well for this