coderfender commented on code in PR #2989: URL: https://github.com/apache/datafusion-comet/pull/2989#discussion_r2648750195
########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text +spark.comet.exec.strictFloatingPoint=true +``` + +--- ## Incompatible Expressions -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +Expressions that are not fully Spark-compatible fall back to Spark by default. +They may be enabled using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +See the [Comet Supported Expressions Guide](expressions.md) for details. + +--- ## Regular Expressions -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.regexp.allowIncompatible=true`. +Comet uses Rust's regex crate, which differs from Java's regex engine used by Spark. +Patterns known to produce different results will fall back to Spark. + +This behavior can be overridden using: + +```text +spark.comet.regexp.allowIncompatible=true +``` + +--- ## Window Functions -Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and -should not be used in production. The feature will be enabled in a future release. Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721). +Window functions are disabled by default. +Support is incomplete and may produce incorrect results. + +Tracking issue: +https://github.com/apache/datafusion-comet/issues/2721 + +--- ## Cast -Cast operations in Comet fall into three levels of support: +Cast operations are classified into three categories: -- **Compatible**: The results match Apache Spark -- **Incompatible**: The results may match Apache Spark for some inputs, but there are known issues where some inputs - will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting - `spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not - recommended for production use. -- **Unsupported**: Comet does not provide a native version of this cast expression and the query stage will fall back to - Spark. +- **Compatible** – Results match Spark +- **Incompatible** – May differ for some inputs; fallback by default +- **Unsupported** – Always falls back to Spark ### Compatible Casts -The following cast operations are generally compatible with Spark except for the differences noted here. +The following casts are generally compatible with Spark, with noted differences. Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -161,7 +196,7 @@ The following cast operations are generally compatible with Spark except for the | string | float | | | string | double | | | string | binary | | -| string | date | Only supports years between 262143 BC and 262142 AD | +| string | date | Years supported: 262143 BC to 262142 AD | Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -136,14 +171,14 @@ The following cast operations are generally compatible with Spark except for the | float | integer | | | float | long | | | float | double | | -| float | string | There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45 | +| float | string | Precision differences possible | | double | boolean | | | double | byte | | | double | short | | | double | integer | | | double | long | | | double | float | | -| double | string | There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45 | +| double | string | Precision differences possible | Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -136,14 +171,14 @@ The following cast operations are generally compatible with Spark except for the | float | integer | | | float | long | | | float | double | | -| float | string | There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45 | +| float | string | Precision differences possible | Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. Review Comment: Not sure if we are covering Decimal -> Binary format documentation in this change ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -152,7 +187,7 @@ The following cast operations are generally compatible with Spark except for the | decimal | float | | | decimal | double | | | decimal | decimal | | -| decimal | string | There can be formatting differences in some case due to Spark using scientific notation where Comet does not | +| decimal | string | Formatting differences possible | Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text +spark.comet.exec.strictFloatingPoint=true +``` + +--- ## Incompatible Expressions -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +Expressions that are not fully Spark-compatible fall back to Spark by default. +They may be enabled using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +See the [Comet Supported Expressions Guide](expressions.md) for details. + +--- ## Regular Expressions -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.regexp.allowIncompatible=true`. +Comet uses Rust's regex crate, which differs from Java's regex engine used by Spark. +Patterns known to produce different results will fall back to Spark. + +This behavior can be overridden using: + +```text +spark.comet.regexp.allowIncompatible=true +``` + +--- ## Window Functions -Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and -should not be used in production. The feature will be enabled in a future release. Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721). +Window functions are disabled by default. +Support is incomplete and may produce incorrect results. + +Tracking issue: +https://github.com/apache/datafusion-comet/issues/2721 + Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. Review Comment: We might want to keep the changes limited to the issue being addressed IMHO ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text +spark.comet.exec.strictFloatingPoint=true +``` + +--- ## Incompatible Expressions -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +Expressions that are not fully Spark-compatible fall back to Spark by default. +They may be enabled using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +See the [Comet Supported Expressions Guide](expressions.md) for details. + +--- ## Regular Expressions -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.regexp.allowIncompatible=true`. +Comet uses Rust's regex crate, which differs from Java's regex engine used by Spark. +Patterns known to produce different results will fall back to Spark. + +This behavior can be overridden using: + +```text +spark.comet.regexp.allowIncompatible=true +``` + +--- ## Window Functions -Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and -should not be used in production. The feature will be enabled in a future release. Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721). +Window functions are disabled by default. +Support is incomplete and may produce incorrect results. + +Tracking issue: +https://github.com/apache/datafusion-comet/issues/2721 + +--- ## Cast -Cast operations in Comet fall into three levels of support: +Cast operations are classified into three categories: -- **Compatible**: The results match Apache Spark -- **Incompatible**: The results may match Apache Spark for some inputs, but there are known issues where some inputs - will result in incorrect results or exceptions. The query stage will fall back to Spark by default. Setting - `spark.comet.expression.Cast.allowIncompatible=true` will allow all incompatible casts to run natively in Comet, but this is not - recommended for production use. -- **Unsupported**: Comet does not provide a native version of this cast expression and the query stage will fall back to - Spark. +- **Compatible** – Results match Spark +- **Incompatible** – May differ for some inputs; fallback by default +- **Unsupported** – Always falls back to Spark Review Comment: Is there a reason we removed the additional comet conf here ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text +spark.comet.exec.strictFloatingPoint=true +``` + +--- ## Incompatible Expressions -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +Expressions that are not fully Spark-compatible fall back to Spark by default. +They may be enabled using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text Review Comment: Not sure if this intended ? ########## docs/source/user-guide/latest/compatibility.md: ########## @@ -20,70 +20,105 @@ under the License. # Compatibility Guide Comet aims to provide consistent results with the version of Apache Spark that is being used. +This guide documents known differences and limitations. -This guide offers information about areas of functionality where there are known differences. +--- ## Parquet Comet has the following limitations when reading Parquet files: -- Comet does not support reading decimals encoded in binary format. -- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported. +- Decimal values encoded using binary format are not supported. +- Default values for nested types (e.g., maps, arrays, structs) are not supported. + Literal default values are supported. + +--- ## ANSI Mode -Comet will fall back to Spark for the following expressions when ANSI mode is enabled. These expressions can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +When ANSI mode is enabled, Comet falls back to Spark for some expressions to ensure correctness. + +Affected expressions include: + +- `Average` +- `Cast` (in some cases) + +Incompatible expressions can be forced to run natively using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +This is not recommended for production use. -- Average -- Cast (in some cases) +Tracking issue for full ANSI support: +https://github.com/apache/datafusion-comet/issues/313 -There is an [epic](https://github.com/apache/datafusion-comet/issues/313) where we are tracking the work to fully implement ANSI support. +--- ## Floating-point Number Comparison -Spark normalizes NaN and zero for floating point numbers for several cases. See `NormalizeFloatingNumbers` optimization rule in Spark. -However, one exception is comparison. Spark does not normalize NaN and zero when comparing values -because they are handled well in Spark (e.g., `SQLOrderingUtil.compareFloats`). But the comparison -functions of arrow-rs used by DataFusion do not normalize NaN and zero (e.g., [arrow::compute::kernels::cmp::eq](https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html#)). -So Comet adds additional normalization expression of NaN and zero for comparisons, and may still have differences -to Spark in some cases, especially when the data contains both positive and negative zero. This is likely an edge -case that is not of concern for many users. If it is a concern, setting `spark.comet.exec.strictFloatingPoint=true` -will make relevant operations fall back to Spark. +Spark normalizes NaN and signed zero in most cases, except during comparisons. +DataFusion (via Arrow) does not apply the same normalization. + +Comet adds additional normalization logic but may still differ in rare edge cases, +especially when both positive and negative zero are present. + +To force Spark execution for strict correctness: + +```text +spark.comet.exec.strictFloatingPoint=true +``` + +--- ## Incompatible Expressions -Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting -`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See -the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting. +Expressions that are not fully Spark-compatible fall back to Spark by default. +They may be enabled using: + +```text +spark.comet.expression.EXPRNAME.allowIncompatible=true +``` + +See the [Comet Supported Expressions Guide](expressions.md) for details. + +--- ## Regular Expressions -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.regexp.allowIncompatible=true`. +Comet uses Rust's regex crate, which differs from Java's regex engine used by Spark. +Patterns known to produce different results will fall back to Spark. + Review Comment: Not sure if this intended ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
