Is the data changing during the time you are running the sqoop import? The --verbose option should put some interesting debug information in your logs (such as the query being used at execution) and the different WHERE parameters being used for splitting.
-Abe On Mon, Jan 13, 2014 at 12:02 PM, yogesh kumar <[email protected]>wrote: > here was the code I have used ... > > > sqoop import -libjars > --driver com.sybase.jdbc3.jdbc.SybDriver \ > --query "select * from > from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \ > > --check-column Unique_value \ > --incremental append \ > --last-value 201401200 \ > --split-by DEPT \ > --fields-terminated-by ',' \ > --target-dir ${TARGET_DIR}/${INC} \ > --username ${SYBASE_USERNAME} \ > --password ${SYBASE_PASSWORD} \ > > > now I have imported newly inserted data into RDBMS to HDFS > > but when I do > > select count(*) , unique_value from EMP group by unique_value (both in > RDBMS and in HIVE) > > I can find huge data loss. > > 1) in RDBMS > > Count(*) Unique_value > 1000 201401201 > 5000 201401202 > 10000 201401203 > > > 2) in HIVE > > Count(*) Unique_value > 189 201401201 > 421 201401202 > 50 201401203 > > > If I do > > select Unique value from emp ; > > Result : > 201401201 > 201401201 > 201401201 > 201401201 > 201401201 > . > . > 201401202 > . > . > and so on... > > > On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <[email protected]>wrote: > >> Hello Abe, >> >> Thanks a zillion for your response, >> >> Yes the unique_vale is SAL over here .. >> >> I have an option to test with different directory as its having small >> amount of data (only 1 month data) >> >> Yes the interesting fact is I did sqoop pull on the basis on month, into >> which I am have not done incremental import, and the data matches very >> well, but for daily pull I have to do incremental import every day.. >> >> and then i find some data loss... >> >> one of the sqoop guy has suggested me to use " \$CONDITIONS " with in >> this query as ...... from EMP where \$CONDITIONS and SAL > 201401200 >> and SAL <= 201401204 \ >> >> like >> >> EMP where SAL > 201401200 and SAL <= 201401204 and \$CONDITIONS >> >> >> >> Plz do suggest me pls help me out as I have to output to my client.. >> >> >> >> >> >> >> >> >> On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[email protected]>wrote: >> >>> Yogesh, >>> >>> Is unique_value in this case SAL? I'm a bit confused about your query. >>> >>> Do you have the option of running this query on a separate database >>> somewhere to find the issue? I think it would be interesting to see the >>> initial state and then the state after running an incremental import. That >>> would tell us how many results are being imported after sqoop has ran and >>> we can validate each step. Also, please use the --verbose flag to get the >>> most out of the logs. >>> >>> -Abe >>> >>> >>> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy >>> <[email protected]>wrote: >>> >>>> Yogesh, >>>> >>>> Please try to put $CONDITIONS after your where clause. >>>> >>>> Checkout the examples in the below blog. >>>> >>>> >>>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1 >>>> On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote: >>>> >>>>> Hello Jarcec, >>>>> >>>>> I got the issue hope this is the cause.. I got data loss by doing >>>>> incremental pull >>>>> >>>>> I have crossed checked it and found that >>>>> >>>>> sqoop import -libjars >>>>> --driver com.sybase.jdbc3.jdbc.SybDriver \ >>>>> --query "select * from >>>>> from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \ >>>>> --check-column Unique_value \ >>>>> --incremental append \ >>>>> --last-value 201401200 \ >>>>> --split-by DEPT \ >>>>> --fields-terminated-by ',' \ >>>>> --target-dir ${TARGET_DIR}/${INC} \ >>>>> --username ${SYBASE_USERNAME} \ >>>>> --password ${SYBASE_PASSWORD} \ >>>>> >>>>> >>>>> now I have imported newly inserted data into RDBMS to HDFS >>>>> >>>>> but when I do >>>>> >>>>> select count(*) , unique_value from EMP group by unique_value (both in >>>>> RDBMS and in HIVE) >>>>> >>>>> I can find huge data loss. >>>>> >>>>> 1) in RDBMS >>>>> >>>>> Count(*) Unique_value >>>>> 1000 201401201 >>>>> 5000 201401202 >>>>> 10000 201401203 >>>>> >>>>> >>>>> 2) in HIVE >>>>> >>>>> Count(*) Unique_value >>>>> 189 201401201 >>>>> 421 201401202 >>>>> 50 201401203 >>>>> >>>>> >>>>> If I do >>>>> >>>>> select Unique value from emp ; >>>>> >>>>> Result : >>>>> 201401201 >>>>> 201401201 >>>>> 201401201 >>>>> 201401201 >>>>> 201401201 >>>>> . >>>>> . >>>>> 201401202 >>>>> . >>>>> . >>>>> and so on... >>>>> >>>>> >>>>> Pls help and suggest why is it so >>>>> >>>>> >>>>> Many thanks in advance >>>>> >>>>> Yogesh kumar >>>>> >>>>> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Yogesh, >>>>>> I would start by verifying imported data. If there are duplicates >>>>>> than it's suggesting some miss configuration of Sqoop, otherwise you >>>>>> might >>>>>> have some inconsistencies down the pipeline. >>>>>> >>>>>> Jarcec >>>>>> >>>>>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote: >>>>>> > Hello All, >>>>>> > >>>>>> > I am working on a use case where I have to run a process on daily >>>>>> basis >>>>>> > which will do these. >>>>>> > >>>>>> > 1) Pull every day new data inserted into RDBMS tables to HDFS >>>>>> > 2) Having external table in hive (pointing to the location of HDFS >>>>>> > directry where data is pulled by sqoop) >>>>>> > 3) Perform some hive queries (joins) and create a final internal >>>>>> table into >>>>>> > Hive (say.. Hive_Table_Final). >>>>>> > >>>>>> > >>>>>> > What I am doing.. >>>>>> > >>>>>> > I am migrating a process from RDBMS to HADOOP ( same process is >>>>>> being >>>>>> > executed in RDBMS procedure and stored in final table . {say.. >>>>>> > Rdbms_Table_Final} ) >>>>>> > >>>>>> > Issue I am facing is. >>>>>> > >>>>>> > Every time I do Incremental import and after processing I find the >>>>>> final >>>>>> > table in hive having the value multiplied by every time I do >>>>>> incremental >>>>>> > import (If I do incremental import to bring new data into HDFS , >>>>>> the data >>>>>> > in final table of hive after processing i.e "Hive_Table_Final" >>>>>> showing >>>>>> > the values of all columns multiplied by the times of I done >>>>>> incremental >>>>>> > pull), if I do perform incremental import for 4 days ( every day >>>>>> once >>>>>> > incremental import in a day and did it for 4 days) i got data >>>>>> multiplied >>>>>> > 4 in the final table of hive (Hive_Table_Final) with respect to >>>>>> final >>>>>> > table in RDBMS (Rdbms_final_table). >>>>>> > >>>>>> > >>>>>> > Like.. >>>>>> > >>>>>> > 1) 1st time I have pulled the data from RDBMS based on the months >>>>>> (like >>>>>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results >>>>>> > matching the data in final Hive's table(Hive_Table_Final) and RDBMS >>>>>> > processed data into (Rdbms_Table_Final) >>>>>> > >>>>>> > 2) I have done incremental import to bring new data from RDBMS to >>>>>> HDFS by >>>>>> > using this command.. >>>>>> > >>>>>> > >>>>>> > sqoop import -libjars >>>>>> > --driver com.sybase.jdbc3.jdbc.SybDriver \ >>>>>> > --query "select * from >>>>>> > from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \ >>>>>> > --check-column Unique_value \ >>>>>> > --incremental append \ >>>>>> > --last-value 201401200 \ >>>>>> > --split-by DEPT \ >>>>>> > --fields-terminated-by ',' \ >>>>>> > --target-dir ${TARGET_DIR}/${INC} \ >>>>>> > --username ${SYBASE_USERNAME} \ >>>>>> > --password ${SYBASE_PASSWORD} \ >>>>>> > >>>>>> > "Note -- The field Unique_value is very unique for every time, its >>>>>> > like primary key " >>>>>> > >>>>>> > >>>>>> > >>>>>> > As now I have just pulled the new records to my HDFS which were >>>>>> into RDBMS >>>>>> > tables.. >>>>>> > >>>>>> > Now I got major data mis-match issue, after the >>>>>> > processing..(Hive_Table_final) >>>>>> > >>>>>> > My Major issue is with sqoop incremental import, as many times I do >>>>>> > Incremental import I find the data into my final table gets >>>>>> multiplied by >>>>>> > the times I have done incremental import.. >>>>>> > >>>>>> > >>>>>> > Please suggest, whats wrong I am doing, Whats I am missing.. >>>>>> > pls help me out.. >>>>>> > >>>>>> > >>>>>> > Thanks & Regards >>>>>> > Yogesh Kumar >>>>>> >>>>> >>>>> >>> >> >
