> val d = HiveContext.read.format("jdbc").options( ... >> The sqoop job takes 7 hours to load 15 days of data, even while setting >>the direct load option to 6. Hive is using MR framework.
In generaly, the jdbc implementations tend to react rather badly to large extracts like this - the throttling usually happens on the operational database end rather than being a problem on the MR side. Sqoop is good enough for a one-shot import, but doing it frequently is best done by the database's own dump protocols, which are generally not throttled similarly. Pinterest recently put out a document on how they do this https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p art-1 + https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p art-2 More interesting continous ingestion reads directly off the replication protocol write-ahead logs. https://github.com/Flipkart/MySQL-replication-listener/tree/master/examples /mysql2hdfs + https://github.com/flipkart-incubator/storm-mysql But all of these tend to be optimized to a database engine, while the JDBC pipe tends to work slowly for all engines. Cheers, Gopal