Thanks Joanne, so normally what's the best practice here if I were to use sqoop? For now, I feel like just doing a brand new re-import everytime makes more sense, but I will need to keep eyes on the performance side, what you think?
Shengjie On 8 August 2013 00:15, Joanne Chan <[email protected]> wrote: > 1. No > 2. Then, you are out of luck. > > Sqoop isn't magic, imagine how would sqoop know what data was updated > without any indicators? > > > On Wed, Aug 7, 2013 at 12:12 PM, shengjie min <[email protected]>wrote: > >> >> >> - Add an update_time column to the source table and do the incremental >> load by that update_time column. >> >> two questions to that then: >> >> 1. will sqoop job updates that field "update_time" automatically when it >> runs OR my application needs to write that field? >> 2. What if I don't wanna or I can't touch the schema on the source table? >> >> THanks, >> Shengjie >> >> On 8 Aug 2013, at 00:06, Joanne Chan <[email protected]> wrote: >> >> Add an update_time column to the source table and do the incremental load >> by that update_time column. >> >> >> On Wed, Aug 7, 2013 at 12:04 PM, shengjie min <[email protected]>wrote: >> >>> Hi guys, >>> >>> TO simplify my question, Let's say, I have a mysql table called >>> 'student', looks like this: >>> >>> +----+----------+-----+ >>> | id | name | sex | >>> +----+----------+-----+ >>> | 1 | Alice | 0 | >>> | 2 | Bob | 1 | >>> | 3 | Charles | 1 | >>> +----+----------+-----+ >>> >>> I want to import this table to HBase periodically which means I will run >>> this sqoop job periodically. There are two goals: >>> >>> A. every time there is a new record inserted to mysql table, e.g. (4, >>> David, 1), I hope my next sqoop import will catch it and put it in HBase. >>> B. if there is any updates have been made to mysql rows 1, 2, 3, I want >>> to have the updates in HBase too after next round sqoop import. >>> >>> I checked two types incremental updates sqoop has: Append mode seems >>> only satisfied goal A while Last-modified mode will require my mysql table >>> has a timestamp column for each row(which I don't in real life). I know if >>> I don't use incremental updates options at all, I can just get way with it >>> by running a fresh import every time, but if my mysql table is really huge >>> and fresh import might be a performance killer. >>> >>> Is there anyway I can just do incremental updates instead of having to >>> re-run the whole import to get NEW RECORDS + UPDATES ON OLD ROWS? >>> >>> >>> Shengjie >> >> >> >> >> -- >> -- JChan >> >> >> > > > -- > -- JChan > > -- All the best, Shengjie Min
