[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-1073504409 @pratyakshsharma -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-1073504190 The purpose of introducing timestamps: Mainly when users want to consume from a certain location, deltastreamer can only specify checkpoint sites in the past. For example, kafka may have 50+ partitions, and users need to manually configure the checkpoint string. Introducing this simplifies this operation Regarding your example: I think you are right and agree with your idea. Partition 2 should not be populated with this value. At that time, the main consideration of this PR was to solve the problem of complex user configuration. It can simplify consumption data as much as possible. This example of partition 2 makes sense for some businesses. Maybe your current scenario may be a bit contradictory, and I feel like we can improve it and make it better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-881820983 @nsivabalan Thank you for your concern and patience to help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-881278532 @nsivabalan I have completed the changes as you requested, please take a look~ Thank you very much for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-881203019 > Let's try to land this in by weekend. Its been hanging for quite sometime. ok. Sorry, I'll deal with it now, please excuse me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-875975684 woking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-863968623 deltaSync should reset this(...kafka.checkpoint.type) configuration (similar to how we reset checkpoints) In this way, we may need to store this in the metadata file. If it is a memory modification, there is a greater risk. I have submitted my latest implementation, please help to see if it is feasible @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-852970149 I am currently facing a problem and would like to hear your opinion After we add this type, hoodie.deltastreamer.source.kafka.checkpoint.type=timestamp I am currently thinking, does deltastreamer.checkpoint.key maintain the status quo? The format is still: topicName,0:123,1:456 If we continue to maintain the above format, when we specify: for example --checkpoint 1622635064, we need to determine the relationship between commitMetadata.getMetadata(CHECKPOINT_KEY) and --checkpoint 1622635064 in org.apache.hudi.utilities.deltastreamer.DeltaSync#readFromSource, This seems to be contrary to the results of our discussion, do not add kafka dependent code in DeltaSync Do you have any suggestions for this? thanks @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-845791067 > @liujinhui1994 : were you able to make progress on this. would be nice to have this in before next release. Sorry, I was too busy with work before~ I just sorted out the whole idea of this PR, clarified the goal, and will start soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-811856655 > Myself and Nishith discussed on this. Here is our proposal. > Let's rely on Deltastreamer.Config.checkpoint to pass in any type of checkpoint. > We can add another config called "checkpoint.type" which could default to string for all default checkpoints. For checkpoint of interest of this PR, we could set the value for this new config to "timestamp". > > With this, its upto each source to parse and interpret the checkpoint value and DeltaSync does not need to deal w/ diff checkpointing formats. > > Having said this, DeltaSync readFromSource() should not have any changes in this diff. > KafkaOffsetGen should have logic to parse diff checkpoint values, based on two values(deltastreamer.config.checkpoint and checkpoint.type). > > With this, we also moved source specific checkpointing logic within source specific class and did not leak it to DeltaSync which should be agnostic to different Source. > > @liujinhui1994 : Let me know what do you think. Happy to chat more on this. Great, I will modify this PR based on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-799512747 no problem -- Original -- From: Sivabalan Narayanan ***@***.***> Date: Mon,Mar 15,2021 11:28 PM To: apache/hudi ***@***.***> Cc: liujinhui ***@***.***>, Mention ***@***.***> Subject: Re: [apache/hudi] [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp (#2438) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-787662595 The current implementation is mainly in KafkaOffsetGen @wangxianghu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-786452099 I will add the unit test, and then please review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-782589121 I have verified, please help review @wangxianghu @yanghua @nsivabalan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-782588952 @yanghua @wangxianghu @nsivabalan I have verified, please help review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org