malinjawi commented on PR #11900:
URL: https://github.com/apache/gluten/pull/11900#issuecomment-4244594187

   > malinjawi many thanks for taking time for testing.
   > 
   > > Therefore, the two approaches are effectively performance-neutral for 
this test, so JVM handoff path is not clearly better then native in these 
testing.
   > 
   > If the performance is at the same level, should we just pass DVs from Java 
to C++? So we can save some energy from maintaining the C++ DV reader by 
ourselves. Or do you have other reasons to add the C++ DV reader?
   
   Thanks @zhztheplayer . 
   
   I looked at this more closely and tried to separate the architectural 
reasoning from the benchmark question.
   
   Based on the results so far, I do not think we have a strong performance 
argument for keeping the C++ DV reader by itself as it stands now the observed 
advantage is small and actually shrank as total query work grew.
   
   
   The more important point is the breakdown inside that DV slice:
   
   - inline cases were almost entirely dominated by native 
`applyDeletionFilter(...)`
   - for moderate stored DVs, native read/load was usually only about `5%` to 
`8%` of native current total
   - only in denser stored cases did native load rise to about `17%` to `26%`
   - even in those denser cases, zero-copy handoff still stayed close to parity 
with native current
   
   So from an Amdahl's law perspective, even if we made the native reader 
perfect, the upside is usually limited because the hotter path is still native 
apply/filter, not materialization/loading.
   
   
   
   So I think my takeaways are: 
   
   1. Right now we dont have a strong performance argument for keeping the C++ 
DV reader by itself.
   2. I also dont think the data supports the current metadata/string-style 
handoff path, because that path gets clearly more expensive as payload size 
grows.
   3. The dominant cost is still native DV consume/apply/filter in the scan 
path.
   
   I think this also helps separate the native path into more specific 
responsibilities:
   
   - descriptor / protocol handling
   - materialization / loading
   - native consume / deserialization
   - scan-time apply / filter
   - caching / reuse
   
   Based on the current measurements, the hottest and most valuable native 
parts are:
   
   - native consume / deserialization
   - scan-time apply / filter
   - scan integration
   
   The native reader/materialization side is easier to question, because it 
does not currently show a strong speed win.
   
   So my current view is:
   
   -The lowest maintenance design would be to let JVM / Delta utilities own DV 
materialization/loading and keep native focused on consume/apply/filter
   
   -The most self-contained approach would be to let native scan as we can 
build on top of it as shown in [Velox 
POC](https://github.com/malinjawi/velox/tree/wip-delta-dv-implementation) and 
[Gluten 
POC](https://github.com/malinjawi/incubator-gluten/tree/delta-dv-gluten-only-migration)
   
   
   The main architectural reasons I still see for keeping the native DV reader 
are:
   - it is the cleanest base for future native caching
   - it gives native full control over DV file I/O, prefetch, and loading 
policy for future work
   - it aligns better with the broader native Velox DV direction from previous 
efforts made 
   - it keeps the scan path more self-contained and less dependent on JVM 
materialization at execution time
   - it is potentially more reusable for broader Velox-native DV patches beyond 
the Gluten
   
   But this is not currently evident by this PR but rather future architectural 
not currently demonstrated benchmark wins.
   
   
   So, I think we could go either way with this implementation and I think the 
design choice between native reader and JVM should now be made mostly on 
maintenance / integration / future tradeoffs. What do you think would be the 
best approach all things considered @zhouyuan @zhztheplayer ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to