yihua opened a new pull request, #5269:
URL: https://github.com/apache/hudi/pull/5269

   ## What is the purpose of the pull request
   
   - In Deltastreamer, we re-instantiate WriteClient whenever schema changes. 
Same write client is used by all async table services as well. This poses an 
issue, because the new write client when re-instantiated is intimated to the 
async table service. but if the async table service is in the middle of 
compaction, uses a local copy of write client. and hence may not be able to 
reach the timeline server and will run into connection issues. We are fixing 
this in this patch. 
   - We have a singleton instance of embedded timeline service which regular 
writers and all table services will use. And within async table services, we 
will listen to write config changes and re-instantiate write client before any 
new compaction execution. 
   - Even between multiple re-instantiations of write clients for regular 
writer (due to schema change), uses the same singleton embedded timeline 
server. 
   - Previously embedded timeline server was shutdown when write client was 
shutdown. Fixed that in this patch, so that a single instantiation and tear 
down of embedded timeline server will span entire process start and stop. 
   - This also fixes a long standing issue w/ spark structured streaming. 
Apparently, this is what is happening in spark structured streaming flow. We 
start a new write client during first batch and close it at the end. But keep 
re-using the same instance of writeClient for subsequent batches. Only core 
entity that is impacted here was the embedded timeline server since we were 
closing it when write client was closed. So, after batch1, if timeline server 
was enabled, pipeline will fail since timeline server is shutdown. So, in this 
patch we are fixing that as well. Embedded timeline server is externally 
instantiated and so writeClient.close() will not shutdown the timeline server. 
We have a singleton instance of timeline server through entire pipeline. 
Previously we hard coded DIRECT style markers for spark streaming, but after 
this patch, we should be able to relax that. 
   
   
   ## Brief change log
   
   - Fixed Deltastreamer and Spark streaking sink to ensure timeline server 
sustains multiple instantiations of write client by different wriiters. 
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
     - *Manually verified the change by running a job locally.*
     - For structured streaming, existing tests cover all flows. 
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to