This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new 4cfceb799 docs: Categorize some configs as `testing` and add notes
about known time zone issues (#2740)
4cfceb799 is described below
commit 4cfceb799942a3a502ed8ff350cd3fda441c4e51
Author: Andy Grove <[email protected]>
AuthorDate: Mon Nov 10 11:32:49 2025 -0700
docs: Categorize some configs as `testing` and add notes about known time
zone issues (#2740)
---
.../main/scala/org/apache/comet/CometConf.scala | 26 +++++++++++-----------
docs/source/user-guide/latest/configs.md | 10 ++++-----
2 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala
b/common/src/main/scala/org/apache/comet/CometConf.scala
index 496284ce6..da30d7029 100644
--- a/common/src/main/scala/org/apache/comet/CometConf.scala
+++ b/common/src/main/scala/org/apache/comet/CometConf.scala
@@ -181,31 +181,31 @@ object CometConf extends ShimCometConf {
val COMET_CONVERT_FROM_PARQUET_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.convert.parquet.enabled")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_TESTING)
.doc(
"When enabled, data from Spark (non-native) Parquet v1 and v2 scans
will be converted to " +
- "Arrow format. Note that to enable native vectorized execution, both
this config and " +
- "`spark.comet.exec.enabled` need to be enabled.")
+ "Arrow format. This is an experimental feature and has known issues
with " +
+ "non-UTC timezones.")
.booleanConf
.createWithDefault(false)
val COMET_CONVERT_FROM_JSON_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.convert.json.enabled")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_TESTING)
.doc(
"When enabled, data from Spark (non-native) JSON v1 and v2 scans will
be converted to " +
- "Arrow format. Note that to enable native vectorized execution, both
this config and " +
- "`spark.comet.exec.enabled` need to be enabled.")
+ "Arrow format. This is an experimental feature and has known issues
with " +
+ "non-UTC timezones.")
.booleanConf
.createWithDefault(false)
val COMET_CONVERT_FROM_CSV_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.convert.csv.enabled")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_TESTING)
.doc(
"When enabled, data from Spark (non-native) CSV v1 and v2 scans will
be converted to " +
- "Arrow format. Note that to enable native vectorized execution, both
this config and " +
- "`spark.comet.exec.enabled` need to be enabled.")
+ "Arrow format. This is an experimental feature and has known issues
with " +
+ "non-UTC timezones.")
.booleanConf
.createWithDefault(false)
@@ -633,19 +633,19 @@ object CometConf extends ShimCometConf {
val COMET_SPARK_TO_ARROW_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.sparkToColumnar.enabled")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_TESTING)
.doc("Whether to enable Spark to Arrow columnar conversion. When this is
turned on, " +
"Comet will convert operators in " +
"`spark.comet.sparkToColumnar.supportedOperatorList` into Arrow
columnar format before " +
- "processing.")
+ "processing. This is an experimental feature and has known issues with
non-UTC timezones.")
.booleanConf
.createWithDefault(false)
val COMET_SPARK_TO_ARROW_SUPPORTED_OPERATOR_LIST: ConfigEntry[Seq[String]] =
conf("spark.comet.sparkToColumnar.supportedOperatorList")
- .category(CATEGORY_SCAN)
+ .category(CATEGORY_TESTING)
.doc("A comma-separated list of operators that will be converted to
Arrow columnar " +
- "format when `spark.comet.sparkToColumnar.enabled` is true")
+ s"format when `${COMET_SPARK_TO_ARROW_ENABLED.key}` is true.")
.stringConf
.toSequence
.createWithDefault(Seq("Range,InMemoryTableScan,RDDScan"))
diff --git a/docs/source/user-guide/latest/configs.md
b/docs/source/user-guide/latest/configs.md
index fd232874e..0a12abadc 100644
--- a/docs/source/user-guide/latest/configs.md
+++ b/docs/source/user-guide/latest/configs.md
@@ -27,15 +27,10 @@ Comet provides the following configuration settings.
<!--BEGIN:CONFIG_TABLE[scan]-->
| Config | Description | Default Value |
|--------|-------------|---------------|
-| `spark.comet.convert.csv.enabled` | When enabled, data from Spark
(non-native) CSV v1 and v2 scans will be converted to Arrow format. Note that
to enable native vectorized execution, both this config and
`spark.comet.exec.enabled` need to be enabled. | false |
-| `spark.comet.convert.json.enabled` | When enabled, data from Spark
(non-native) JSON v1 and v2 scans will be converted to Arrow format. Note that
to enable native vectorized execution, both this config and
`spark.comet.exec.enabled` need to be enabled. | false |
-| `spark.comet.convert.parquet.enabled` | When enabled, data from Spark
(non-native) Parquet v1 and v2 scans will be converted to Arrow format. Note
that to enable native vectorized execution, both this config and
`spark.comet.exec.enabled` need to be enabled. | false |
| `spark.comet.scan.allowIncompatible` | Some Comet scan implementations are
not currently fully compatible with Spark for all datatypes. Set this config to
true to allow them anyway. For more information, refer to the [Comet
Compatibility
Guide](https://datafusion.apache.org/comet/user-guide/compatibility.html). |
false |
| `spark.comet.scan.enabled` | Whether to enable native scans. When this is
turned on, Spark will use Comet to read supported data sources (currently only
Parquet is supported natively). Note that to enable native vectorized
execution, both this config and `spark.comet.exec.enabled` need to be enabled.
| true |
| `spark.comet.scan.preFetch.enabled` | Whether to enable pre-fetching feature
of CometScan. | false |
| `spark.comet.scan.preFetch.threadNum` | The number of threads running
pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is
enabled. Note that more pre-fetching threads means more memory requirement to
store pre-fetched row groups. | 2 |
-| `spark.comet.sparkToColumnar.enabled` | Whether to enable Spark to Arrow
columnar conversion. When this is turned on, Comet will convert operators in
`spark.comet.sparkToColumnar.supportedOperatorList` into Arrow columnar format
before processing. | false |
-| `spark.comet.sparkToColumnar.supportedOperatorList` | A comma-separated list
of operators that will be converted to Arrow columnar format when
`spark.comet.sparkToColumnar.enabled` is true | Range,InMemoryTableScan,RDDScan
|
| `spark.hadoop.fs.comet.libhdfs.schemes` | Defines filesystem schemes (e.g.,
hdfs, webhdfs) that the native side accesses via libhdfs, separated by commas.
Valid only when built with hdfs feature enabled. | |
<!--END:CONFIG_TABLE-->
@@ -127,9 +122,14 @@ These settings can be used to determine which parts of the
plan are accelerated
| Config | Description | Default Value |
|--------|-------------|---------------|
| `spark.comet.columnar.shuffle.memory.factor` | Fraction of Comet memory to
be allocated per executor process for columnar shuffle when running in on-heap
mode. For more information, refer to the [Comet Tuning
Guide](https://datafusion.apache.org/comet/user-guide/tuning.html). | 1.0 |
+| `spark.comet.convert.csv.enabled` | When enabled, data from Spark
(non-native) CSV v1 and v2 scans will be converted to Arrow format. This is an
experimental feature and has known issues with non-UTC timezones. | false |
+| `spark.comet.convert.json.enabled` | When enabled, data from Spark
(non-native) JSON v1 and v2 scans will be converted to Arrow format. This is an
experimental feature and has known issues with non-UTC timezones. | false |
+| `spark.comet.convert.parquet.enabled` | When enabled, data from Spark
(non-native) Parquet v1 and v2 scans will be converted to Arrow format. This
is an experimental feature and has known issues with non-UTC timezones. | false
|
| `spark.comet.exec.onHeap.enabled` | Whether to allow Comet to run in on-heap
mode. Required for running Spark SQL tests. Can be overridden by environment
variable `ENABLE_COMET_ONHEAP`. | false |
| `spark.comet.exec.onHeap.memoryPool` | The type of memory pool to be used
for Comet native execution when running Spark in on-heap mode. Available pool
types are `greedy`, `fair_spill`, `greedy_task_shared`,
`fair_spill_task_shared`, `greedy_global`, `fair_spill_global`, and
`unbounded`. | greedy_task_shared |
| `spark.comet.memoryOverhead` | The amount of additional memory to be
allocated per executor process for Comet, in MiB, when running Spark in on-heap
mode. | 1024 MiB |
+| `spark.comet.sparkToColumnar.enabled` | Whether to enable Spark to Arrow
columnar conversion. When this is turned on, Comet will convert operators in
`spark.comet.sparkToColumnar.supportedOperatorList` into Arrow columnar format
before processing. This is an experimental feature and has known issues with
non-UTC timezones. | false |
+| `spark.comet.sparkToColumnar.supportedOperatorList` | A comma-separated list
of operators that will be converted to Arrow columnar format when
`spark.comet.sparkToColumnar.enabled` is true. |
Range,InMemoryTableScan,RDDScan |
| `spark.comet.testing.strict` | Experimental option to enable strict testing,
which will fail tests that could be more comprehensive, such as checking for a
specific fallback reason. Can be overridden by environment variable
`ENABLE_COMET_STRICT_TESTING`. | false |
<!--END:CONFIG_TABLE-->
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]