pushpavanthar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of more than one incrementing 
columns. Incrementing columns can be of below types 
   1. Timestamp column
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provide it from config for each table.
   When accepting Timestamp incrementing column, there can be more than once 
columns contributing to this strategy. e.g. During a row is creation only 
`created_at` column is set and let's say `updated_at` is null by default. When 
the same row is updated, `updated_at` gets assigned to some timestamp. In such 
scenarios its wise to consider both columns in your query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` for MySQL). I 
understand this adds load on Database, but this tracks the last pulled 
timestamp or auto incrementing column and helps retry from that point for 
consecutive batches. This will be a saviour during failures. 
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to