The docs say that "The merge tool is typically run after an incremental import with the date-last-modified mode (sqoop import --incremental lastmodified …)." Also, I am not sure how the merge tool handles deleted records, looking at the code of merge tool, i don't think it deletes the corresponding records from target which were deleted in the source.
~Pratik On Fri, Sep 19, 2014 at 8:43 AM, Radhe Radhe <[email protected]> wrote: > Hi Pratik, > > The source RDBMs tables will not always necessarily contain a > 'last-modified' column. > > I see that 'sqoop-merge' is there: > The merge tool allows you to combine two datasets where entries in one > dataset should overwrite entries of an older dataset. For example, an > incremental import run in last-modified mode will generate multiple > datasets in HDFS where successively newer data appears in each dataset. The > mergetool will "flatten" two datasets into one, taking the newest > available records for each primary key. > > Can 'sqoop-merge' be used to get the updated rows somehow? Will that be a > viable solution? > > Thanks, > -RR > ------------------------------ > From: [email protected] > Date: Fri, 19 Sep 2014 08:18:23 -0700 > Subject: Re: Sqoop Incremental mode to get updated data in Hive > To: [email protected] > > > What other columns do you have in your table which can be used? > Either a last modified timestamp based column or a version id based column > is needed for sqoop to know if something new came along > since the last import. Even if we were to write our own version of sqoop, > we would still need that information. > > Another option could have been to monitor the query logs of the RDBMS and > batch the logs and then issue imports. But sqoop does not monitor the query > logs. > > ~Pratik > > On Fri, Sep 19, 2014 at 6:41 AM, Radhe Radhe <[email protected] > > wrote: > > Hello Experts, > > I need to extract data from RDBMS tables to Hive Tables on a timely > basis(daily/weekly, etc.). > > My tables have a '*Primary Key*' but does NOT have a '*last-modified*' > column. > > I plan to go as: > *FIRST RUN:* Will use sqoop-import-all-tables command for importing all > the tables at one go to Hive Tables. > *EACH SUBSEQUENT RUN:* Use Sqoop incremental import mode to retrieve only > rows newer than some previously-imported set of rows. > > My question is how can I get the updated rows which got updated in between > the *FIRST RUN *and *NEXT SUBSEQUENT RUN*. > For e.g. > Say in *FIRST RUN: *I fetched all tables. > A CUSTOMER table that has 100 records with CustomerId as Primary Key(1 to > 100) is imported to Hive CUSTOMER table. > > And now meanwhile some rows in the Source CUSTOMER table got updated. > With my *NEXT SUBSEQUENT RUN *this will fetch rows > 100, thus skipping > the updated rows which are < 100. > > How can I get the updated rows on each subsequent run. > > Referring to* Sqoop documentation* this strategy works only if we have a > last-modified column(*which in my case don't have*): > An alternate table update strategy supported by Sqoop is called > lastmodified mode. > You should use this when rows of the source table may be updated, and each > such update will set the value of a last-modified column to the current > timestamp. > Rows where the check column holds a timestamp more recent than the > timestamp specified with --last-value are imported. > > Any suggestions on how to get the updated data on SUBSEQUENT RUNS using > the Sqoop Incremental mode? > > Thanks, > -RR > > >
