malinjawi commented on PR #11900: URL: https://github.com/apache/gluten/pull/11900#issuecomment-4244594187
> malinjawi many thanks for taking time for testing. > > > Therefore, the two approaches are effectively performance-neutral for this test, so JVM handoff path is not clearly better then native in these testing. > > If the performance is at the same level, should we just pass DVs from Java to C++? So we can save some energy from maintaining the C++ DV reader by ourselves. Or do you have other reasons to add the C++ DV reader? Thanks @zhztheplayer . I looked at this more closely and tried to separate the architectural reasoning from the benchmark question. Based on the results so far, I do not think we have a strong performance argument for keeping the C++ DV reader by itself as it stands now the observed advantage is small and actually shrank as total query work grew. The more important point is the breakdown inside that DV slice: - inline cases were almost entirely dominated by native `applyDeletionFilter(...)` - for moderate stored DVs, native read/load was usually only about `5%` to `8%` of native current total - only in denser stored cases did native load rise to about `17%` to `26%` - even in those denser cases, zero-copy handoff still stayed close to parity with native current So from an Amdahl's law perspective, even if we made the native reader perfect, the upside is usually limited because the hotter path is still native apply/filter, not materialization/loading. So I think my takeaways are: 1. Right now we dont have a strong performance argument for keeping the C++ DV reader by itself. 2. I also dont think the data supports the current metadata/string-style handoff path, because that path gets clearly more expensive as payload size grows. 3. The dominant cost is still native DV consume/apply/filter in the scan path. I think this also helps separate the native path into more specific responsibilities: - descriptor / protocol handling - materialization / loading - native consume / deserialization - scan-time apply / filter - caching / reuse Based on the current measurements, the hottest and most valuable native parts are: - native consume / deserialization - scan-time apply / filter - scan integration The native reader/materialization side is easier to question, because it does not currently show a strong speed win. So my current view is: -The lowest maintenance design would be to let JVM / Delta utilities own DV materialization/loading and keep native focused on consume/apply/filter -The most self-contained approach would be to let native scan as we can build on top of it as shown in [Velox POC](https://github.com/malinjawi/velox/tree/wip-delta-dv-implementation) and [Gluten POC](https://github.com/malinjawi/incubator-gluten/tree/delta-dv-gluten-only-migration) The main architectural reasons I still see for keeping the native DV reader are: - it is the cleanest base for future native caching - it gives native full control over DV file I/O, prefetch, and loading policy for future work - it aligns better with the broader native Velox DV direction from previous efforts made - it keeps the scan path more self-contained and less dependent on JVM materialization at execution time - it is potentially more reusable for broader Velox-native DV patches beyond the Gluten But this is not currently evident by this PR but rather future architectural not currently demonstrated benchmark wins. So, I think we could go either way with this implementation and I think the design choice between native reader and JVM should now be made mostly on maintenance / integration / future tradeoffs. What do you think would be the best approach all things considered @zhouyuan @zhztheplayer ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
