comphead commented on code in PR #2868: URL: https://github.com/apache/datafusion-comet/pull/2868#discussion_r2604284790
########## docs/source/user-guide/latest/iceberg.md: ########## @@ -140,7 +145,50 @@ scala> spark.sql(s"SELECT * from t1").explain() +- CometBatchScan spark_catalog.default.t1[c0#26, c1#27] spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters: [] ``` -## Known issues +### Known issues - Spark Runtime Filtering isn't [working](https://github.com/apache/datafusion-comet/issues/2116) - You can bypass the issue by either setting `spark.sql.adaptive.enabled=false` or `spark.comet.exec.broadcastExchange.enabled=false` + +## Fully-Native Execution + +Comet's fully-native Iceberg integration does not require modifying Iceberg source +code. Instead, Comet relies on reflection to extract `FileScanTask`s from Iceberg, which are +then serialized to Comet's native execution engine (see +[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)). + +The example below uses Spark's package downloader to retrieve Comet 0.12.0 and Iceberg +1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, and 1.10. The key configuration +to enable fully-native Iceberg is `spark.comet.scan.icebergNative.enabled=true`. This +configuration should **not** be used with the hybrid Iceberg configuration +`spark.sql.iceberg.parquet.reader-type=COMET` from above. + +```shell +$SPARK_HOME/bin/spark-shell \ + --packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \ + --repositories https://repo1.maven.org/maven2/ \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.spark_catalog.type=hadoop \ + --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/warehouse \ + --conf spark.plugins=org.apache.spark.CometPlugin \ + --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ + --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ + --conf spark.comet.scan.icebergNative.enabled=true \ + --conf spark.comet.explainFallback.enabled=true \ + --conf spark.memory.offHeap.enabled=true \ + --conf spark.memory.offHeap.size=2g +``` + +The same sample queries from above can be used to test Comet's fully-native Iceberg integration, +however the scan node to look for is `CometIcebergNativeScan`. + +### Current limitations + +- Iceberg table spec v3 scans will fall back. Review Comment: perhaps lets name it as `not supported` or `work in progress` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
