Suresh Venkatesan created SQOOP-3240:
----------------------------------------
Summary: Hive/Hadoop table Row count mismatch(more records in
hadoop/hive than oracle) when importing data from oracle using sqoop import
direct mode
Key: SQOOP-3240
URL: https://issues.apache.org/jira/browse/SQOOP-3240
Project: Sqoop
Issue Type: Bug
Components: connectors/oracle
Affects Versions: 1.4.6
Environment: Sqoop 1.4.6-cdh5.8.2
Oracle 11i
oracle table Sqoop import with direct mode
Reporter: Suresh Venkatesan
Priority: Minor
Hive/Hadoop row count was not matching on sqoop import of oracle table with
direct mode, it's perfectly matching without direct mode.
See the below examples.
1. Row Count is varying based on fetch-size 10000 and direct mode.
2. Row count is not consistent, sometimes it is 630869 and sometimes 639924
etc.,
3. We are pulling data with direct mode for 1 year with monthly partition, data
mismatch happening for couple of months not for all months.
4. Hive/Hadoop table Count looks good, If we import the data from oracle
without direct mode.
Actual row count in oracle table for year 2015 and month 12 is : 609924, we are
getting more records on sqoop import with direct mode : 630869. hive table
count is matching with oracle count on sqoop import *without direct* mode :
630869
*1. sqoop import without direct mode, Row Count : 630869*
sqoop import -Dmapredue.job.queuename=queue
-Doraoop.import.partitions=oracle_table_partition_201512 --connect
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table
oracle_db.oracle_table --split-by split_column --hive-import --hive-table
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')"
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4
*2. sqoop import direct mode, Row Count : 630869*
sqoop import -Dmapredue.job.queuename=queue
-Doraoop.import.partitions=oracle_table_partition_201512 --connect
jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table
oracle_db.oracle_table --split-by split_column --hive-import --hive-table
hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN
TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')"
--fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4
--direct
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)