Hi,
If we are talking about billions of records and depending on your network
and RDBMs with parallel connections, from my experience it works OK for
Dimension tables of moderate size, in that you can have parallel
connections to RDBMS (assuming the RDBMS has a primary key/unique column)
to
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
data source during the data processing, main goal of spark is still
analyzing data from HDFS-like file system.
to use spark as a data integration tool to transfer billions of records
from RDBMS to HDFS etc. could work, but
Hi Team,
One of my client teams is trying to see if they can use Spark to source
data from RDBMS instead of Sqoop. Data would be substantially large in the
order of billions of records.
I am not sure reading the documentations whether jdbcRDD by design is going
to be able to scale well for this