the-other-tim-brown commented on PR #13399: URL: https://github.com/apache/hudi/pull/13399#issuecomment-2957517297
> > In spark, the executors will ask the timeline server to create markers for the files being created which in turn will launch more spark tasks if the spark engine context is used > > I checked the code in `MarkerDirState` for `HoodieEngineContext` usages, one is in `MarkerDirState` constructor for markers sync and another is for delete all markers, both of these should happen only once because the `MarkerDirState` itself is a singleton on timeline server? And these 2 should not be called while creating markers for data files? The `MarkerDirState` is not a singleton, see that there is a [map](https://github.com/apache/hudi/blob/master/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java#L92) of these. We will initialize these entries in `getMarkerDirState` which is called from the path that creates the markers which is called when we create the data files. In general, using the spark executors to do basic tasks like list a directory and apply a basic function on the file statuses and then delete those files is too much overhead. When you are running on a spark cluster with other tasks running, you will wait for an executor to become available (FIFO by default) just to do these basic operations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
