RE: Sqoop Incremental mode to get updated data in Hive

Radhe Radhe Fri, 19 Sep 2014 10:23:32 -0700

Hi Pratik,
Saying that means there is no straight way to get the updated rows in an 
increment mode where the table does not have a 'last-modified' column.
The ONLY way the increment mode will work is with the 'append' mode in case the 
table does NOT have 'last-modified' column.
You should specify append mode when importing a table where new rows are 
continually being added with increasing row id values. You specify the column 
containing the row's id with --check-column. Sqoop imports rows where the check 
column has a value greater than the one specified with --last-value.
Also Hive does not support the notion of Updating/Deleting 
rows.https://issues.apache.org/jira/browse/HIVE-5317
The append mode will not help in my case as I want the updated rows in the 
Target Hive tables via the Sqoop import increment mode.
I believe that there has to be some work around for this as this is one of the 
most basic scenarios for an increment import from Source to Target.
Thanks,-RR

From: [email protected]
Date: Fri, 19 Sep 2014 08:51:04 -0700
Subject: Re: Sqoop Incremental mode to get updated data in Hive
To: [email protected]

The docs say that "The merge tool is typically run after an incremental import 
with the date-last-modified mode (sqoop import --incremental lastmodified …)."
Also, I am not sure how the merge tool handles deleted records, looking at the 
code of merge tool, i don't think it deletes the corresponding records from 
target which were deleted in the source.
~Pratik
On Fri, Sep 19, 2014 at 8:43 AM, Radhe Radhe <[email protected]> 
wrote:

Hi Pratik,
The source RDBMs tables will not always necessarily contain a 'last-modified' 
column.
I see that 'sqoop-merge' is there:The merge tool allows you to combine two 
datasets where entries in one dataset should overwrite entries of an older 
dataset. For example, an incremental import run in last-modified mode will 
generate multiple datasets in HDFS where successively newer data appears in 
each dataset. The mergetool will "flatten" two datasets into one, taking the 
newest available records for each primary key.

Can 'sqoop-merge' be used to get the updated rows somehow? Will that be a 
viable solution?
Thanks,-RR
From: [email protected]
Date: Fri, 19 Sep 2014 08:18:23 -0700
Subject: Re: Sqoop Incremental mode to get updated data in Hive
To: [email protected]

What other columns do you have in your table which can be used?Either a last 
modified timestamp based column or a version id based column is needed for 
sqoop to know if something new came alongsince the last import. Even if we were 
to write our own version of sqoop, we would still need that information.
Another option could have been to monitor the query logs of the RDBMS and batch 
the logs and then issue imports. But sqoop does not monitor the query logs.
~Pratik
On Fri, Sep 19, 2014 at 6:41 AM, Radhe Radhe <[email protected]> 
wrote:

Hello Experts,
I need to extract data from RDBMS tables to Hive Tables on a timely 
basis(daily/weekly, etc.).
My tables have a 'Primary Key' but does NOT have a 'last-modified' column.
I plan to go as:FIRST RUN: Will use sqoop-import-all-tables command for 
importing all the tables at one go to Hive Tables.EACH SUBSEQUENT RUN: Use 
Sqoop incremental import mode to retrieve only rows newer than some 
previously-imported set of rows.
My question is how can I get the updated rows which got updated in between the 
FIRST RUN and NEXT SUBSEQUENT RUN.For e.g. Say in FIRST RUN: I fetched all 
tables. A CUSTOMER table that has 100 records with CustomerId as Primary Key(1 
to 100) is imported to Hive CUSTOMER table.
And now meanwhile some rows in the Source CUSTOMER table got updated.With my 
NEXT SUBSEQUENT RUN this will fetch rows > 100, thus skipping the updated rows 
which are  < 100. 
How can I get the updated rows on each subsequent run.
Referring to Sqoop documentation this strategy works only if we have a 
last-modified column(which in my case don't have):An alternate table update 
strategy supported by Sqoop is called lastmodified mode. You should use this 
when rows of the source table may be updated, and each such update will set the 
value of a last-modified column to the current timestamp. Rows where the check 
column holds a timestamp more recent than the timestamp specified with 
--last-value are imported.
Any suggestions on how to get the updated data on SUBSEQUENT RUNS using the 
Sqoop Incremental mode?
Thanks,-RR

RE: Sqoop Incremental mode to get updated data in Hive

Reply via email to