RE: Data mismatch when importing data from Oracle to Hive through Sqoop without an error
Hi Venkat, All most column have some value except these three. Regards, Ajit -Original Message- From: Venkat Ranganathan [mailto:vranganat...@hortonworks.com] Sent: Wednesday, March 06, 2013 9:36 PM To: user@hive.apache.org Cc: u...@sqoop.apache.org Subject: Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error Hi Ajit Do you know if rest of the columns also are null when the three non null columns are null Venkat On Wed, Mar 6, 2013 at 12:35 AM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi Abhijeet, Thanks for your response. If values that don't fit in double must be getting inserted as Null is the case then count should not be mis-match in both the case. Here the null value inserted are extra value apart from the other value which is already present in both Oracle Table and Hive table. Correct me if I am wrong in interpretation. Thanks and Regards, Ajit Kumar Shreevastava From: abhijeet gaikwad [mailto:abygaikwa...@gmail.com] Sent: Wednesday, March 06, 2013 1:46 PM To: user@hive.apache.org Cc: u...@sqoop.apache.org Subject: Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the values that don't fit in double must be getting inserted as NULL. You can see this warning in your logs. Thanks, Abhijeet On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi all, I have notice one interesting thing in the below result-set. I have fired one query in both Oracle and Hive shell and found the following result set:à SQL select count(1) from bttn 2 where bttn_id is null or data_inst_id is null or scr_id is null; COUNT(1) -- 0 hive select count(1) from bttn where bttn_id is null or data_inst_id is null or scr_id is null; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201303051835_0020, Tracking URL = http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020 Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job -kill job_201303051835_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-03-06 13:22:56,908 Stage-1 map = 0%, reduce = 0% 2013-03-06 13:23:05,928 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:06,931 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:07,934 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:08,938 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:09,941 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:10,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:11,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:12,956 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:13,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:14,962 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:15,965 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:16,969 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:17,974 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:18,977 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:19,981 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:20,985 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:21,988 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:22,995 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:23,998 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec MapReduce Total cumulative CPU time: 6 seconds 950 msec Ended Job = job_201303051835_0020 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.95 sec HDFS Read: 184270926 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 950 msec OK 986 Time taken: 35.983 seconds hive and 739169 - 738183=986 can anyone tell me why this happened as BTTN_ID , DATA_INST_ID, SCR_ID are not null constrains of BTTN table and also composite Primary Key of the table? Also tell me how can I prevent this unnecessary data generation in HIVE table. Regards Ajit Kumar Shreevastava From
Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error
Hi Ajit Do you know if rest of the columns also are null when the three non null columns are null Venkat On Wed, Mar 6, 2013 at 12:35 AM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi Abhijeet, Thanks for your response. If values that don’t fit in double must be getting inserted as Null is the case then count should not be mis-match in both the case. Here the null value inserted are extra value apart from the other value which is already present in both Oracle Table and Hive table. Correct me if I am wrong in interpretation. Thanks and Regards, Ajit Kumar Shreevastava From: abhijeet gaikwad [mailto:abygaikwa...@gmail.com] Sent: Wednesday, March 06, 2013 1:46 PM To: user@hive.apache.org Cc: u...@sqoop.apache.org Subject: Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the values that don't fit in double must be getting inserted as NULL. You can see this warning in your logs. Thanks, Abhijeet On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi all, I have notice one interesting thing in the below result-set. I have fired one query in both Oracle and Hive shell and found the following result set:à SQL select count(1) from bttn 2 where bttn_id is null or data_inst_id is null or scr_id is null; COUNT(1) -- 0 hive select count(1) from bttn where bttn_id is null or data_inst_id is null or scr_id is null; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201303051835_0020, Tracking URL = http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020 Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job -kill job_201303051835_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-03-06 13:22:56,908 Stage-1 map = 0%, reduce = 0% 2013-03-06 13:23:05,928 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:06,931 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:07,934 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:08,938 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:09,941 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:10,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:11,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:12,956 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:13,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:14,962 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:15,965 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:16,969 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 5.2 sec 2013-03-06 13:23:17,974 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:18,977 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:19,981 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:20,985 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:21,988 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:22,995 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec 2013-03-06 13:23:23,998 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.95 sec MapReduce Total cumulative CPU time: 6 seconds 950 msec Ended Job = job_201303051835_0020 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.95 sec HDFS Read: 184270926 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 950 msec OK 986 Time taken: 35.983 seconds hive and 739169 – 738183=986 can anyone tell me why this happened as BTTN_ID , DATA_INST_ID, SCR_ID are not null constrains of BTTN table and also composite Primary Key of the table? Also tell me how can I prevent this unnecessary data generation in HIVE table. Regards Ajit Kumar Shreevastava From: Ajit Kumar Shreevastava Sent: Wednesday, March 06, 2013 12:40 PM To: 'u...@sqoop.apache.org' Cc: user@hive.apache.org Subject: Data mismatch when importing data from Oracle to Hive through Sqoop without an error HI, I have a table BTTN in Oracle database having 738183 records:à SQL select count(1) from bttn; COUNT(1) -- 738183 Now I want to import
Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error
Hi Ajit, I've seen similar issue many times. Does your table have textual data? If so, can it happen that your textual data contains hive delimiters like new line characters? Because if so then Sqoop might create two lines in for one single row in the table that will be consequently seen as two rows in Hive. As Hive will implicitly convert any invalid values into NULL, it would also explain your NULL values. Sqoop offers arguments --hive-drop-import-delims and --hive-delims-replacement to deal with this problem. More information can be found in Sqoop user guide [1]. Jarcec Links: 1: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_importing_data_into_hive On Wed, Mar 06, 2013 at 10:50:26PM +0530, abhijeet gaikwad wrote: If you see Hadoop job counters in your logs - numbers are correct till that point. So normal import in HDFS is working fine. Only reasonable explanation is that there is an issue inserting data in Hive. Have a look at the file on HDFS it should contain the correct number of records with correct data. But if you do select * ... from Hive console it will give you NULL values for invalid data, as in the data that cannot be converted to the respective data type of the column of Hive table. This is because Hive imposes data type constraints on the data in files on HDFS. Those extra records is a mystry for me too. I don't think actual file on HDFS has extra records. Sqoop is creating Hive table incorrectly which **may** have caused this issue. Create table manually with correct data types if equivalent available or as string and try loading data. If this works fine we know where the issue is! Thanks, Abhijeet On Wed, Mar 6, 2013 at 2:05 PM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi Abhijeet, ** ** Thanks for your response. If values that don’t fit in double must be getting inserted as Null is the case then count should not be mis-match in both the case. Here the null value inserted are extra value apart from the other value which is already present in both Oracle Table and Hive table. ** ** Correct me if I am wrong in interpretation. ** ** Thanks and Regards, Ajit Kumar Shreevastava ** ** *From:* abhijeet gaikwad [mailto:abygaikwa...@gmail.com] *Sent:* Wednesday, March 06, 2013 1:46 PM *To:* user@hive.apache.org *Cc:* u...@sqoop.apache.org *Subject:* Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error ** ** Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the values that don't fit in double must be getting inserted as NULL. You can see this warning in your logs. Thanks, Abhijeet On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava ajit.shreevast...@hcl.com wrote: Hi all, I have notice one interesting thing in the below result-set. I have fired one query in both Oracle and Hive shell and found the following result set:à *SQL select count(1) from bttn* * 2 where bttn_id is null or data_inst_id is null or scr_id is null; * * * * COUNT(1)* *--* * 0* *hive select count(1) from bttn* * where bttn_id is null or data_inst_id is null or scr_id is null;*** ** Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201303051835_0020, Tracking URL = http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020http://nhclt-pc44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020 Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job -kill job_201303051835_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-03-06 13:22:56,908 Stage-1 map = 0%, reduce = 0% 2013-03-06 13:23:05,928 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:06,931 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:07,934 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:08,938 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:09,941 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:10,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:11,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.2 sec 2013-03-06 13:23:12,956 Stage-1 map = 100%, reduce = 0