vinothchandar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-561213523
 
 
   @pushpavanthar Great suggestion.. 
   
   Let me see if we can structure this solution more,. Just supporting raw sql 
as input for extracting the data with the hoodie checkpoint simply being a list 
of string replaces in a template sql, could  provide a lot of flexibility 
   
   Taking the same example from above. 
   
   user specifies the following SQL. (we can blog and document this well)
   
   ```
   hoodie.datasource.jdbc.sql=SELECT 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) as 
created_updated_at, inventory.customers.user_id as user_id, * FROM 
inventory.customers WHERE created_updated_at > ${1} AND created_updated_at < 
${1} AND user_id  > ${2}  ORDER BY created_updated_at ASC
   hoodie.datasource.jdbc.incremental.column.names=created_updated_at, user_id
   hoodie.datasource.jdbc.incremental.column.funcs=max, min
   hoodie.datasource.jdbc.bulkload.sql=<sql to load it once initially or we 
could use some all inclusive filters for column names like user_id > 0 etc >
   ```
   
   Hoodie checkpoint is a list of string values, once for each of the 
incremental column names, e.g `2019113048384, 1001` (timestamp and a user_id). 
we simple replace `{1}` with 2019113048384 and `{2}` with the user_id or second 
checkpoint value. Execute the sql, and then use the column funcs to derive the 
next checkpoint values off the fetched data set.. I would prefer to keep this 
computation out of the database and in Spark (for same reasons of avoiding more 
load on database)..
   
   All this said, I want to get a basic version working and checked in :) 
first. 
   @taherk77 where are we at for this PR atm? Are you actively working on this? 
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to