Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

yogesh kumar Mon, 13 Jan 2014 12:52:53 -0800

here was the code I have used ...

sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \


--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \


now I have imported newly inserted data into RDBMS to HDFS

but when I do

select count(*) , unique_value from EMP group by unique_value (both in
RDBMS and in HIVE)

I can find huge data loss.

1) in RDBMS

  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203


2) in HIVE

  Count(*)    Unique_value
  189          201401201
   421         201401202
   50           201401203


If I do

select Unique value from emp ;

Result :
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...


On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <[email protected]> wrote:

> Hello Abe,
>
> Thanks a zillion for your response,
>
> Yes the unique_vale is SAL  over here ..
>
> I have an option to test with different directory as its having small
> amount of data (only 1 month data)
>
> Yes the interesting fact is I did sqoop pull on the basis on month, into
> which I am have not done incremental import, and the data matches very
> well, but for daily pull I have to do incremental import every day..
>
> and then i find some data loss...
>
> one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in
> this query as  ......  from EMP where \$CONDITIONS and SAL > 201401200
> and SAL <= 201401204 \
>
> like
>
> EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS
>
>
>
> Plz do suggest me pls help me out as I have to output to my client..
>
>
>
>
>
>
>
>
> On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[email protected]>wrote:
>
>> Yogesh,
>>
>> Is unique_value in this case SAL? I'm a bit confused about your query.
>>
>> Do you have the option of running this query on a separate database
>> somewhere to find the issue? I think it would be interesting to see the
>> initial state and then the state after running an incremental import. That
>> would tell us how many results are being imported after sqoop has ran and
>> we can validate each step. Also, please use the --verbose flag to get the
>> most out of the logs.
>>
>> -Abe
>>
>>
>> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy 
>> <[email protected]>wrote:
>>
>>> Yogesh,
>>>
>>> Please try to put $CONDITIONS after your where clause.
>>>
>>> Checkout the examples in the below blog.
>>>
>>>
>>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
>>>  On Jan 13, 2014 7:04 AM, "yogesh kumar" <[email protected]> wrote:
>>>
>>>> Hello Jarcec,
>>>>
>>>> I got the issue hope this is the cause..  I got data loss by doing
>>>> incremental pull
>>>>
>>>> I have crossed checked it and found that
>>>>
>>>> sqoop import -libjars
>>>>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>  --query "select * from
>>>>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
>>>> --check-column Unique_value \
>>>>  --incremental append \
>>>>  --last-value 201401200 \
>>>>  --split-by DEPT \
>>>>  --fields-terminated-by ',' \
>>>>  --target-dir ${TARGET_DIR}/${INC} \
>>>>  --username ${SYBASE_USERNAME} \
>>>>  --password ${SYBASE_PASSWORD} \
>>>>
>>>>
>>>> now I have imported newly inserted data into RDBMS to HDFS
>>>>
>>>> but when I do
>>>>
>>>> select count(*) , unique_value from EMP group by unique_value (both in
>>>> RDBMS and in HIVE)
>>>>
>>>> I can find huge data loss.
>>>>
>>>> 1) in RDBMS
>>>>
>>>>   Count(*)    Unique_value
>>>>   1000          201401201
>>>>    5000         201401202
>>>>   10000         201401203
>>>>
>>>>
>>>> 2) in HIVE
>>>>
>>>>   Count(*)    Unique_value
>>>>   189          201401201
>>>>    421         201401202
>>>>    50           201401203
>>>>
>>>>
>>>> If I do
>>>>
>>>> select Unique value from emp ;
>>>>
>>>> Result :
>>>> 201401201
>>>> 201401201
>>>> 201401201
>>>> 201401201
>>>> 201401201
>>>> .
>>>> .
>>>> 201401202
>>>> .
>>>> .
>>>> and so on...
>>>>
>>>>
>>>> Pls help and suggest why is it so
>>>>
>>>>
>>>> Many thanks in advance
>>>>
>>>> Yogesh kumar
>>>>
>>>> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <[email protected]
>>>> > wrote:
>>>>
>>>>> Hi Yogesh,
>>>>> I would start by verifying imported data. If there are duplicates than
>>>>> it's suggesting some miss configuration of Sqoop, otherwise you might have
>>>>> some inconsistencies down the pipeline.
>>>>>
>>>>> Jarcec
>>>>>
>>>>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
>>>>> > Hello All,
>>>>> >
>>>>> > I am working on a use case where I have to run a process on daily
>>>>> basis
>>>>> > which will do these.
>>>>> >
>>>>> > 1)  Pull every day new data inserted into RDBMS tables to HDFS
>>>>> > 2)  Having external table in hive (pointing to the location of HDFS
>>>>> > directry where data is pulled by sqoop)
>>>>> > 3) Perform some hive queries (joins) and create a final internal
>>>>> table into
>>>>> > Hive (say.. Hive_Table_Final).
>>>>> >
>>>>> >
>>>>> > What I am doing..
>>>>> >
>>>>> > I am migrating a process from RDBMS to HADOOP ( same process is being
>>>>> > executed in RDBMS procedure and stored in final table . {say..
>>>>> >  Rdbms_Table_Final} )
>>>>> >
>>>>> > Issue I am facing is.
>>>>> >
>>>>> > Every time I do Incremental import and after processing I find the
>>>>> final
>>>>> > table in hive having the value multiplied by every time I do
>>>>> incremental
>>>>> > import (If I do incremental import to bring new data into HDFS , the
>>>>> data
>>>>> > in final table of hive after processing  i.e "Hive_Table_Final"
>>>>>  showing
>>>>> > the values of all columns multiplied by the times of I done
>>>>> incremental
>>>>> > pull), if I do perform incremental import for 4 days ( every day once
>>>>> > incremental import in a day and did it for  4 days) i got  data
>>>>> multiplied
>>>>> > 4 in the final table of hive (Hive_Table_Final)  with respect to
>>>>> final
>>>>> > table in RDBMS (Rdbms_final_table).
>>>>> >
>>>>> >
>>>>> > Like..
>>>>> >
>>>>> > 1) 1st time I have pulled the data from RDBMS based on the months
>>>>> (like
>>>>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results
>>>>> > matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
>>>>> > processed data into (Rdbms_Table_Final)
>>>>> >
>>>>> > 2) I have done incremental import to bring new data from RDBMS to
>>>>> HDFS by
>>>>> > using this command..
>>>>> >
>>>>> >
>>>>> >  sqoop import -libjars
>>>>> >  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>> >  --query "select * from
>>>>> >  from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
>>>>> > --check-column Unique_value \
>>>>> >  --incremental append \
>>>>> >  --last-value 201401200 \
>>>>> >  --split-by DEPT \
>>>>> >  --fields-terminated-by ',' \
>>>>> >  --target-dir ${TARGET_DIR}/${INC} \
>>>>> >  --username ${SYBASE_USERNAME} \
>>>>> >  --password ${SYBASE_PASSWORD} \
>>>>> >
>>>>> > "Note -- The field Unique_value is very unique for every time, its
>>>>> > like primary key "
>>>>> >
>>>>> >
>>>>> >
>>>>> > As now I have just pulled the new records to my HDFS which were into
>>>>>  RDBMS
>>>>> > tables..
>>>>> >
>>>>> > Now I got major data mis-match issue,  after the
>>>>> > processing..(Hive_Table_final)
>>>>> >
>>>>> > My Major issue is with sqoop incremental import, as many times I do
>>>>> > Incremental import I find the  data into my final table gets
>>>>> multiplied by
>>>>> > the times I have done incremental import..
>>>>> >
>>>>> >
>>>>> > Please suggest, whats wrong I am doing, Whats I am missing..
>>>>> > pls help me out..
>>>>> >
>>>>> >
>>>>> > Thanks & Regards
>>>>> > Yogesh Kumar
>>>>>
>>>>
>>>>
>>
>

Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)

Reply via email to