Re: [PR] fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation [datafusion-comet]

via GitHub Thu, 13 Feb 2025 15:21:29 -0800


parthchandra commented on code in PR #1398:
URL: https://github.com/apache/datafusion-comet/pull/1398#discussion_r1955340817



##########
common/src/main/scala/org/apache/comet/CometConf.scala:
##########
@@ -614,7 +614,7 @@ object CometConf extends ShimCometConf {
         "Comet is not currently fully compatible with Spark for all datatypes. 
" +
           s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
       .booleanConf
-      .createWithDefault(true)
+      .createWithDefault(false)

Review Comment:
   +1. I should have done this.



##########
docs/templates/compatibility-template.md:
##########
@@ -17,12 +17,43 @@
   under the License.
 -->
 
+<!-- 
+  TO MODIFY THIS CONTENT MAKE SURE THAT YOU MAKE YOUR CHANGES TO THE TEMPLATE 
FILE  
+  (docs/templates/compatibility-template.md) AND NOT THE GENERATED FILE
+  (docs/source/user-guide/compatibility.md) OTHERWISE YOUR CHANGES MAY BE LOST
+-->
+
 # Compatibility Guide
 
 Comet aims to provide consistent results with the version of Apache Spark that 
is being used.
 
 This guide offers information about areas of functionality where there are 
known differences.
 
+## Parquet Scans
+
+Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
+`spark.comet.scan.impl` is used to select an implementation.
+
+| Implementation          | Description                                        
                                                                                
                                                    |
+| ----------------------- | 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| `native_comet`          | This is the default implementation. It provides 
strong compatibility with Spark but does not support complex types.             
                                                       |
+| `native_datafusion`     | This implementation delegates to DataFusion's 
`ParquetExec`.                                                                  
                                                         |
+| `native_iceberg_compat` | This implementation also delegates to DataFusion's 
`ParquetExec` but uses a hybrid approach of JVM and native code. This scan is 
designed to be integrated with Iceberg in the future. |
+
+The new (and currently experimental) `native_datafusion` and 
`native_iceberg_compat` scans are being added to
+provide the following benefits over the `native_comet` implementation:
+
+- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
+- Provide support for reading complex types (structs, arrays, and maps)
+- Remove the use of reusable mutable-buffers in Comet, which is complex to 
maintain
+
+These new implementations are not fully implemented. Some of the current 
limitations are:
+
+- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce 
incorrect results. By default, Comet  

Review Comment:
   I agree. It can be argued that the results produced by Spark are not correct 
and Comet will actually produce the right result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation [datafusion-comet]

Reply via email to