> val d = HiveContext.read.format("jdbc").options(
...
>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>>the direct load option to 6. Hive is using MR framework.

In generaly, the jdbc implementations tend to react rather badly to large
extracts like this - the throttling usually happens on the operational
database end rather than being a problem on the MR side.


Sqoop is good enough for a one-shot import, but doing it frequently is
best done by the database's own dump protocols, which are generally not
throttled similarly.

Pinterest recently put out a document on how they do this

https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p
art-1

+
https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p
art-2

More interesting continous ingestion reads directly off the replication
protocol write-ahead logs.

https://github.com/Flipkart/MySQL-replication-listener/tree/master/examples
/mysql2hdfs

+
https://github.com/flipkart-incubator/storm-mysql


But all of these tend to be optimized to a database engine, while the JDBC
pipe tends to work slowly for all engines.

Cheers,
Gopal


Reply via email to