Hello all, I am working on setting up incremental imports on a table that will update on a Last Modified Date column as the check column.
In my testing, I have created an external table, and imported sample tables with the job created thus: sqoop job --create import_testing_users -- import --connect "jdbc:sqlserver://11.111.11.11;database=Testing;username=blah;password=blahblah" --table Users --warehouse-dir /user/blah/Reports/testing --append --fields-terminated-by '\001' --incremental lastmodified --check-column Last_Modified And this seems to be picking up the updates properly. However, when I update the timestamps in the check column, I end up with new rows that have the new data, and old rows that still have the old data, and my SELECT statements are picking up multiple rows. I'm sure I could select where last_modified = max(last_modified) but it's not ideal. I've found mention of this as a problem, and have not found a solution other than to put the new values into a side directory and run sqoop merge on the data to flatten it. I'd like to automate this in a shell script, and was wondering if there is some better way than to run a merge operation in the script. Is the merge still the best solution for this situation? Also, does importing into a external table have an impact on this operation? Incidentally, I found it necessary to add the "--warehouse-dir \path\to\file\ --append" argument on this job, whereas the jobs I set up for a MySQL import (they are --incremental append" jobs) did not need it. I was just wondering if that was a driver difference, or a job configuration difference? Thanks, *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
