sqoop splitting

Brotanek, Jan Wed, 22 Feb 2017 06:51:52 -0800

Hello, I am transferring incremental data from Oracle table using sqoop 1.4.2. 
direct using following command:


sqoop import --libjars /usr/hdp/2.4.0.0-169/sqoop/lib/ojdbc6.jar --direct 
--connect jdbc:oracle:thin:@//xxx/xxx --username xxx --password xxx --table 
A.A1 --as-textfile --optionally-enclosed-by '"' --check-column DAY  
--incremental append --last-value "DATE'2016-09-30'" --split-by DAY 
--warehouse-dir /tmp/benchmarks/ -m4 -z;



17/01/21 14:36:20 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT 
MIN(DAY), MAX(DAY) FROM A.A1 WHERE ( DAY > DATE'2016-09-30' AND DAY <= 
TO_TIMESTAMP('2098-01-21 13:09:16.0', 'YYYY-MM-DD HH24:MI:SS.FF') )

In Oracle database it looks like this:
... WHERE ( DAY > DATE'2016-09-30' AND DAY <= TO_TIMESTAMP('2098-01-21 
13:09:16.0', 'YYYY-MM-DD HH24:MI:SS.FF') ) AND ( DAY >= 
TO_TIMESTAMP('2016-09-30 00:00:01.0', 'YYYY-MM-DD HH24:MI:SS.FF') ) AND ( DAY < 
TO_TIMESTAMP('2037-01-27 14:32:19.75', 'YYYY-MM-DD HH24:MI:SS.FF') )

As far as I understand, sqoop determine max and min value, then it divides 
interval by number of mappers, then starts to select from 1st interval ... 4th 
interval and retrieves data to HDFS

A1 Table has some dummy year 2098 records which are used by other systems and I 
cannot remove them from data.
This will split table to 4 mappers where 3 mappers have few data and 4th one is 
overloaded, resulting in having 4files from which only one contains most of 
table's data, which causes problems in transforming data to ORC tables.

Desired output would be having 4files split evenly. Is there any approach I can 
achieve it?

Thanks
Jan

sqoop splitting

Reply via email to