pchintar commented on PR #4591: URL: https://github.com/apache/datafusion-comet/pull/4591#issuecomment-4851412494
Hi @andygrove , thanks again for the detailed review! I've addressed your review comments in the latest update: * Reworked the planner logic to determine whether the native cache path should be used from the configured cache serializer together with the feature flag, eliminating the planning-time Spark job previously used to inspect cached batches. * Removed the unnecessary driver-side `collect().distinct()` from the serializer read path * Replaced the length-based identity projection shortcut with an explicit identity-index check so only true identity projections bypass projection, while reordered projections continue to project columns correctly. * Updated the test suite to remove the Spark 4.x-specific `classic.Dataset` dependency from the shared test code, keeping the suite cross-version compatible. * For the ownership concern, I took another look at the current implementation but didn't make any further changes. The current implementation already decodes only the requested columns through `convertCachedBatchToColumnarBatch`, and I couldn't identify a way to simplify that path further without changing its behavior. I also added the new `CometInMemoryCacheBenchmark` under `spark/src/test/scala/org/apache/spark/sql/benchmark/`, following the existing benchmark style. It compares repeated scans of a cached table with the native cache path enabled and disabled, together with a selective-filter variant to exercise the pruning path. On my local machine, I observed modest improvements for both workloads. When you have a chance, could you please take a look at the benchmark? I tried to follow the requested benchmark structure but if you'd prefer any further changes to the workload/setup, I would be happy to update it further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
