RE: Data mismatch when importing data from Oracle to Hive through Sqoop without an error

2013-03-08 Thread Ajit Kumar Shreevastava
Hi Venkat,

All most column have some value except these three.

Regards,
Ajit

-Original Message-
From: Venkat Ranganathan [mailto:vranganat...@hortonworks.com] 
Sent: Wednesday, March 06, 2013 9:36 PM
To: user@hive.apache.org
Cc: u...@sqoop.apache.org
Subject: Re: Data mismatch when importing data from Oracle to Hive through 
Sqoop without an error

Hi Ajit

Do you know if rest of the columns also are null when the three non null 
columns are null

Venkat

On Wed, Mar 6, 2013 at 12:35 AM, Ajit Kumar Shreevastava 
ajit.shreevast...@hcl.com wrote:
 Hi Abhijeet,



 Thanks for your response.

 If values that don't fit in double must be getting inserted as Null is 
 the case then count should not be mis-match in both the case.

 Here the null value inserted are extra value apart from the other 
 value which is already present in both Oracle Table and Hive table.



 Correct me if I am wrong in interpretation.



 Thanks and Regards,

 Ajit Kumar Shreevastava



 From: abhijeet gaikwad [mailto:abygaikwa...@gmail.com]
 Sent: Wednesday, March 06, 2013 1:46 PM
 To: user@hive.apache.org
 Cc: u...@sqoop.apache.org
 Subject: Re: Data mismatch when importing data from Oracle to Hive 
 through Sqoop without an error



 Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think 
 the values that don't fit in double must be getting inserted as NULL.
 You can see this warning in your logs.

 Thanks,
 Abhijeet

 On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava 
 ajit.shreevast...@hcl.com wrote:

 Hi all,

 I have notice one interesting thing in the below result-set.

 I have fired one query in both Oracle and Hive shell and found the 
 following result set:à



 SQL select count(1) from bttn

   2  where bttn_id is null or data_inst_id is null or scr_id is null;



   COUNT(1)

 --

  0

 hive select count(1) from bttn

  where bttn_id is null or data_inst_id is null or scr_id is null;

 Total MapReduce jobs = 1

 Launching Job 1 out of 1

 Number of reduce tasks determined at compile time: 1

 In order to change the average load for a reducer (in bytes):

   set hive.exec.reducers.bytes.per.reducer=number

 In order to limit the maximum number of reducers:

   set hive.exec.reducers.max=number

 In order to set a constant number of reducers:

   set mapred.reduce.tasks=number

 Starting Job = job_201303051835_0020, Tracking URL =
 http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020

 Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job  -kill
 job_201303051835_0020

 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 1

 2013-03-06 13:22:56,908 Stage-1 map = 0%,  reduce = 0%

 2013-03-06 13:23:05,928 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:06,931 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:07,934 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:08,938 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:09,941 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:10,944 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:11,947 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:12,956 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:13,959 Stage-1 map = 100%,  reduce = 0%, Cumulative 
 CPU 5.2 sec

 2013-03-06 13:23:14,962 Stage-1 map = 100%,  reduce = 33%, Cumulative 
 CPU
 5.2 sec

 2013-03-06 13:23:15,965 Stage-1 map = 100%,  reduce = 33%, Cumulative 
 CPU
 5.2 sec

 2013-03-06 13:23:16,969 Stage-1 map = 100%,  reduce = 33%, Cumulative 
 CPU
 5.2 sec

 2013-03-06 13:23:17,974 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:18,977 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:19,981 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:20,985 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:21,988 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:22,995 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 2013-03-06 13:23:23,998 Stage-1 map = 100%,  reduce = 100%, Cumulative 
 CPU
 6.95 sec

 MapReduce Total cumulative CPU time: 6 seconds 950 msec

 Ended Job = job_201303051835_0020

 MapReduce Jobs Launched:

 Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.95 sec   HDFS Read: 184270926
 HDFS Write: 4 SUCCESS

 Total MapReduce CPU Time Spent: 6 seconds 950 msec

 OK

 986

 Time taken: 35.983 seconds

 hive



 and  739169 - 738183=986



 can anyone tell me why this happened as BTTN_ID  ,  DATA_INST_ID, 
 SCR_ID are not null constrains of BTTN table and also composite 
 Primary Key of the table?

 Also tell me how can I prevent this unnecessary data generation in 
 HIVE table.



 Regards

 Ajit Kumar Shreevastava



 From

Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error

2013-03-06 Thread Venkat Ranganathan
Hi Ajit

Do you know if rest of the columns also are null when the three non
null columns are null

Venkat

On Wed, Mar 6, 2013 at 12:35 AM, Ajit Kumar Shreevastava
ajit.shreevast...@hcl.com wrote:
 Hi Abhijeet,



 Thanks for your response.

 If values that don’t fit in double must be getting inserted as Null is the
 case then count should not be mis-match in both the case.

 Here the null value inserted are extra value apart from the other value
 which is already present in both Oracle Table and Hive table.



 Correct me if I am wrong in interpretation.



 Thanks and Regards,

 Ajit Kumar Shreevastava



 From: abhijeet gaikwad [mailto:abygaikwa...@gmail.com]
 Sent: Wednesday, March 06, 2013 1:46 PM
 To: user@hive.apache.org
 Cc: u...@sqoop.apache.org
 Subject: Re: Data mismatch when importing data from Oracle to Hive through
 Sqoop without an error



 Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the
 values that don't fit in double must be getting inserted as NULL.
 You can see this warning in your logs.

 Thanks,
 Abhijeet

 On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava
 ajit.shreevast...@hcl.com wrote:

 Hi all,

 I have notice one interesting thing in the below result-set.

 I have fired one query in both Oracle and Hive shell and found the following
 result set:à



 SQL select count(1) from bttn

   2  where bttn_id is null or data_inst_id is null or scr_id is null;



   COUNT(1)

 --

  0

 hive select count(1) from bttn

  where bttn_id is null or data_inst_id is null or scr_id is null;

 Total MapReduce jobs = 1

 Launching Job 1 out of 1

 Number of reduce tasks determined at compile time: 1

 In order to change the average load for a reducer (in bytes):

   set hive.exec.reducers.bytes.per.reducer=number

 In order to limit the maximum number of reducers:

   set hive.exec.reducers.max=number

 In order to set a constant number of reducers:

   set mapred.reduce.tasks=number

 Starting Job = job_201303051835_0020, Tracking URL =
 http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020

 Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job  -kill
 job_201303051835_0020

 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 1

 2013-03-06 13:22:56,908 Stage-1 map = 0%,  reduce = 0%

 2013-03-06 13:23:05,928 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:06,931 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:07,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:08,938 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:09,941 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:10,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:11,947 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:12,956 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:13,959 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
 sec

 2013-03-06 13:23:14,962 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
 5.2 sec

 2013-03-06 13:23:15,965 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
 5.2 sec

 2013-03-06 13:23:16,969 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
 5.2 sec

 2013-03-06 13:23:17,974 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:18,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:19,981 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:20,985 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:21,988 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:22,995 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 2013-03-06 13:23:23,998 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
 6.95 sec

 MapReduce Total cumulative CPU time: 6 seconds 950 msec

 Ended Job = job_201303051835_0020

 MapReduce Jobs Launched:

 Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.95 sec   HDFS Read: 184270926
 HDFS Write: 4 SUCCESS

 Total MapReduce CPU Time Spent: 6 seconds 950 msec

 OK

 986

 Time taken: 35.983 seconds

 hive



 and  739169 – 738183=986



 can anyone tell me why this happened as BTTN_ID  ,  DATA_INST_ID, SCR_ID
 are not null constrains of BTTN table and also composite Primary Key of the
 table?

 Also tell me how can I prevent this unnecessary data generation in HIVE
 table.



 Regards

 Ajit Kumar Shreevastava



 From: Ajit Kumar Shreevastava
 Sent: Wednesday, March 06, 2013 12:40 PM
 To: 'u...@sqoop.apache.org'
 Cc: user@hive.apache.org
 Subject: Data mismatch when importing data from Oracle to Hive through Sqoop
 without an error



 HI,



 I have a table BTTN in Oracle database having 738183 records:à



 SQL select count(1) from bttn;



   COUNT(1)

 --

 738183



 Now I want to import

Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error

2013-03-06 Thread Jarek Jarcec Cecho
Hi Ajit,
I've seen similar issue many times. Does your table have textual data? If so, 
can it happen that your textual data contains hive delimiters like new line 
characters? Because if so then Sqoop might create two lines in for one single 
row in the table that will be consequently seen as two rows in Hive. As Hive 
will implicitly convert any invalid values into NULL, it would also explain 
your NULL values.

Sqoop offers arguments --hive-drop-import-delims and --hive-delims-replacement 
to deal with this problem. More information can be found in Sqoop user guide 
[1].

Jarcec

Links:
1: 
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_importing_data_into_hive

On Wed, Mar 06, 2013 at 10:50:26PM +0530, abhijeet gaikwad wrote:
 If you see Hadoop job counters in your logs - numbers are correct till that
 point. So normal import in HDFS is working fine. Only reasonable
 explanation is that there is an issue inserting data in Hive.
 
 Have a look at the file on HDFS it should contain the correct number of
 records with correct data. But if you do select * ... from Hive console
 it will give you NULL values for invalid data, as in the data that cannot
 be converted to the respective data type of the column of Hive table. This
 is because Hive imposes data type constraints on the data in files on HDFS.
 Those extra records is a mystry for me too. I don't think actual file on
 HDFS has extra records.
 
 Sqoop is creating Hive table incorrectly which **may** have caused this
 issue. Create table manually with correct data types if equivalent
 available or as string and try loading data. If this works fine we know
 where the issue is!
 
 Thanks,
 Abhijeet
 On Wed, Mar 6, 2013 at 2:05 PM, Ajit Kumar Shreevastava 
 ajit.shreevast...@hcl.com wrote:
 
 
   Hi Abhijeet,
 
  ** **
 
  Thanks for your response.
 
  If values that don’t fit in double must be getting inserted as Null is the
  case then count should not be mis-match in both the case.
 
  Here the null value inserted are extra value apart from the other value
  which is already present in both Oracle Table and Hive table.
 
  ** **
 
  Correct me if I am wrong in interpretation.
 
  ** **
 
  Thanks and Regards,
 
  Ajit Kumar Shreevastava
 
  ** **
 
  *From:* abhijeet gaikwad [mailto:abygaikwa...@gmail.com]
  *Sent:* Wednesday, March 06, 2013 1:46 PM
  *To:* user@hive.apache.org
  *Cc:* u...@sqoop.apache.org
  *Subject:* Re: Data mismatch when importing data from Oracle to Hive
  through Sqoop without an error
 
** **
 
  Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the
  values that don't fit in double must be getting inserted as NULL.
  You can see this warning in your logs.
 
  Thanks,
  Abhijeet
 
  On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava 
  ajit.shreevast...@hcl.com wrote:
 
  Hi all,
 
  I have notice one interesting thing in the below result-set.
 
  I have fired one query in both Oracle and Hive shell and found the
  following result set:à
 
   
 
  *SQL select count(1) from bttn*
 
  *  2  where bttn_id is null or data_inst_id is null or scr_id is null;
  *
 
  * *
 
  *  COUNT(1)*
 
  *--*
 
  * 0*
 
  *hive select count(1) from bttn*
 
  * where bttn_id is null or data_inst_id is null or scr_id is null;***
  **
 
  Total MapReduce jobs = 1
 
  Launching Job 1 out of 1
 
  Number of reduce tasks determined at compile time: 1
 
  In order to change the average load for a reducer (in bytes):
 
set hive.exec.reducers.bytes.per.reducer=number
 
  In order to limit the maximum number of reducers:
 
set hive.exec.reducers.max=number
 
  In order to set a constant number of reducers:
 
set mapred.reduce.tasks=number
 
  Starting Job = job_201303051835_0020, Tracking URL =
  http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020http://nhclt-pc44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020
  
 
  Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job  -kill
  job_201303051835_0020
 
  Hadoop job information for Stage-1: number of mappers: 1; number of
  reducers: 1
 
  2013-03-06 13:22:56,908 Stage-1 map = 0%,  reduce = 0%
 
  2013-03-06 13:23:05,928 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:06,931 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:07,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:08,938 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:09,941 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:10,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:11,947 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
  5.2 sec
 
  2013-03-06 13:23:12,956 Stage-1 map = 100%,  reduce = 0