Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove merged PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
hsiang-c commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2286152103
##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -1,186 +1,721 @@
+diff --git a/build.gradle b/build.gradle
+index 7327b38..7967109 100644
+--- a/build.gradle
b/build.gradle
+@@ -780,6 +780,13 @@ project(':iceberg-parquet') {
+ implementation project(':iceberg-core')
+ implementation project(':iceberg-common')
+
++
implementation("org.apache.datafusion:comet-spark-spark${sparkVersionsString}_${scalaVersion}:${libs.versions.comet.get()}")
{
++ exclude group: 'org.apache.arrow'
++ exclude group: 'org.apache.parquet'
++ exclude group: 'org.apache.spark'
++ exclude group: 'org.apache.iceberg'
++}
++
+ implementation(libs.parquet.avro) {
+ exclude group: 'org.apache.avro', module: 'avro'
+ // already shaded by Parquet
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
-index 04ffa8f..d4107be 100644
+index 04ffa8f..cc0099c 100644
--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
-@@ -81,7 +81,7 @@ slf4j = "2.0.16"
+@@ -34,6 +34,7 @@ azuresdk-bom = "1.2.31"
+ awssdk-s3accessgrants = "2.3.0"
+ caffeine = "2.9.3"
+ calcite = "1.10.0"
++comet = "0.10.0-SNAPSHOT"
+ datasketches = "6.2.0"
+ delta-standalone = "3.3.0"
+ delta-spark = "3.3.0"
+@@ -81,7 +82,7 @@ slf4j = "2.0.16"
snowflake-jdbc = "3.22.0"
spark-hive33 = "3.3.4"
spark-hive34 = "3.4.4"
-spark-hive35 = "3.5.4"
-+spark-hive35 = "3.5.6-SNAPSHOT"
++spark-hive35 = "3.5.6"
sqlite-jdbc = "3.48.0.0"
testcontainers = "1.20.4"
tez010 = "0.10.4"
-diff --git a/spark/v3.4/build.gradle b/spark/v3.4/build.gradle
-index 6eb26e8..50cefce 100644
a/spark/v3.4/build.gradle
-+++ b/spark/v3.4/build.gradle
-@@ -75,7 +75,7 @@
project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
- exclude group: 'org.roaringbitmap'
- }
-
--compileOnly
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+compileOnly
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- implementation libs.parquet.column
- implementation libs.parquet.hadoop
-@@ -185,7 +185,7 @@
project(":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVer
- testImplementation libs.avro.avro
- testImplementation libs.parquet.hadoop
- testImplementation libs.junit.vintage.engine
--testImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+testImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- // Required because we remove antlr plugin dependencies from the compile
configuration, see note above
- runtimeOnly libs.antlr.runtime
-@@ -260,6 +260,8 @@
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- integrationImplementation project(path: ':iceberg-hive-metastore',
configuration: 'testArtifacts')
- integrationImplementation project(path:
":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}",
configuration: 'testArtifacts')
- integrationImplementation project(path:
":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}",
configuration: 'testArtifacts')
-+integrationImplementation project(path: ':iceberg-parquet')
-+integrationImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- // runtime dependencies for running Hive Catalog based integration test
- integrationRuntimeOnly project(':iceberg-hive-metastore')
-@@ -297,8 +299,8 @@
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- relocate 'org.apache.avro', 'org.apache.iceberg.shaded.org.apache.avro'
- relocate 'avro.shaded', 'org.apache.iceberg.shaded.org.apache.avro.shaded'
- relocate 'com.thoughtworks.paranamer',
'org.apache.iceberg.shaded.com.thoughtworks.paranamer'
--relocate 'org.apache.parquet',
'org.apache.iceberg.shaded.org.apache.parquet'
--relocate 'shaded.parquet',
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
-+//relocate 'org.apache.parquet',
'org.apache.iceberg.shaded.org.apache.parquet'
-+//relocate 'shaded.parquet',
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
- relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
- relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
- relocate 'org.apache.hc.client5',
'org.apache.iceberg.shaded.org.apache.hc.client5'
-diff --git
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-index 0ca1236..87daef4 100644
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-+++
b/spark/v3.4/spark
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2285513673
##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -1,186 +1,721 @@
+diff --git a/build.gradle b/build.gradle
+index 7327b38..7967109 100644
+--- a/build.gradle
b/build.gradle
+@@ -780,6 +780,13 @@ project(':iceberg-parquet') {
+ implementation project(':iceberg-core')
+ implementation project(':iceberg-common')
+
++
implementation("org.apache.datafusion:comet-spark-spark${sparkVersionsString}_${scalaVersion}:${libs.versions.comet.get()}")
{
++ exclude group: 'org.apache.arrow'
++ exclude group: 'org.apache.parquet'
++ exclude group: 'org.apache.spark'
++ exclude group: 'org.apache.iceberg'
++}
++
+ implementation(libs.parquet.avro) {
+ exclude group: 'org.apache.avro', module: 'avro'
+ // already shaded by Parquet
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
-index 04ffa8f..d4107be 100644
+index 04ffa8f..cc0099c 100644
--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
-@@ -81,7 +81,7 @@ slf4j = "2.0.16"
+@@ -34,6 +34,7 @@ azuresdk-bom = "1.2.31"
+ awssdk-s3accessgrants = "2.3.0"
+ caffeine = "2.9.3"
+ calcite = "1.10.0"
++comet = "0.10.0-SNAPSHOT"
+ datasketches = "6.2.0"
+ delta-standalone = "3.3.0"
+ delta-spark = "3.3.0"
+@@ -81,7 +82,7 @@ slf4j = "2.0.16"
snowflake-jdbc = "3.22.0"
spark-hive33 = "3.3.4"
spark-hive34 = "3.4.4"
-spark-hive35 = "3.5.4"
-+spark-hive35 = "3.5.6-SNAPSHOT"
++spark-hive35 = "3.5.6"
sqlite-jdbc = "3.48.0.0"
testcontainers = "1.20.4"
tez010 = "0.10.4"
-diff --git a/spark/v3.4/build.gradle b/spark/v3.4/build.gradle
-index 6eb26e8..50cefce 100644
a/spark/v3.4/build.gradle
-+++ b/spark/v3.4/build.gradle
-@@ -75,7 +75,7 @@
project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
- exclude group: 'org.roaringbitmap'
- }
-
--compileOnly
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+compileOnly
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- implementation libs.parquet.column
- implementation libs.parquet.hadoop
-@@ -185,7 +185,7 @@
project(":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVer
- testImplementation libs.avro.avro
- testImplementation libs.parquet.hadoop
- testImplementation libs.junit.vintage.engine
--testImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+testImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- // Required because we remove antlr plugin dependencies from the compile
configuration, see note above
- runtimeOnly libs.antlr.runtime
-@@ -260,6 +260,8 @@
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- integrationImplementation project(path: ':iceberg-hive-metastore',
configuration: 'testArtifacts')
- integrationImplementation project(path:
":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}",
configuration: 'testArtifacts')
- integrationImplementation project(path:
":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}",
configuration: 'testArtifacts')
-+integrationImplementation project(path: ':iceberg-parquet')
-+integrationImplementation
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
-
- // runtime dependencies for running Hive Catalog based integration test
- integrationRuntimeOnly project(':iceberg-hive-metastore')
-@@ -297,8 +299,8 @@
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- relocate 'org.apache.avro', 'org.apache.iceberg.shaded.org.apache.avro'
- relocate 'avro.shaded', 'org.apache.iceberg.shaded.org.apache.avro.shaded'
- relocate 'com.thoughtworks.paranamer',
'org.apache.iceberg.shaded.com.thoughtworks.paranamer'
--relocate 'org.apache.parquet',
'org.apache.iceberg.shaded.org.apache.parquet'
--relocate 'shaded.parquet',
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
-+//relocate 'org.apache.parquet',
'org.apache.iceberg.shaded.org.apache.parquet'
-+//relocate 'shaded.parquet',
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
- relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
- relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
- relocate 'org.apache.hc.client5',
'org.apache.iceberg.shaded.org.apache.hc.client5'
-diff --git
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-index 0ca1236..87daef4 100644
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-+++
b/spark/v3.4/spar
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260902271
##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
with:
rust-version: ${{env.RUST_VERSION}}
jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES="-Pspark-${{matrix.spark-version.short}}
-Pscala-${{matrix.scala-version}}" make release
- name: Setup Iceberg
uses: ./.github/actions/setup-iceberg-builder
with:
iceberg-version: ${{ matrix.iceberg-version.full }}
- scala-version: ${{ matrix.scala-version }}
- spark-short-version: ${{ matrix.spark-version.short }}
- - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
- spark-short-version: ${{ matrix.spark-version.short }}
- spark-version: ${{ matrix.spark-version.full }}
- scala-version: ${{ matrix.scala-version }}
- name: Run Iceberg Spark tests
run: |
cd apache-iceberg
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet
cache requires cleanups
- ENABLE_COMET=true ./gradlew -DsparkVersions=${{
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }}
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+ ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{
matrix.scala-version }}:test \
-Pquick=true -x javadoc
+
+ iceberg-spark-extensions:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+ matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+ fail-fast: false
+name: iceberg-spark-extensions/${{ matrix.os }}/iceberg-${{
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+ image: amd64/rust
+env:
+ SPARK_LOCAL_IP: localhost
+steps:
+ - uses: actions/checkout@v4
+ - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+ rust-version: ${{env.RUST_VERSION}}
+ jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES="-Pspark-${{matrix.spark-version.short}}
-Pscala-${{matrix.scala-version}}" make release
+ - name: Setup Iceberg
+uses: ./.github/actions/setup-iceberg-builder
+with:
+ iceberg-version: ${{ matrix.iceberg-version.full }}
+ - name: Run Iceberg Spark extensions tests
+run: |
+ cd apache-iceberg
+ rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet
cache requires cleanups
+ ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-extensions-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:test \
+-Pquick=true -x javadoc
+
+ iceberg-spark-runtime:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+ matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+ fail-fast: false
+name: iceberg-spark-runtime/${{ matrix.os }}/iceberg-${{
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+ image: amd64/rust
+env:
+ SPARK_LOCAL_IP: localhost
+steps:
+ - uses: actions/checkout@v4
+ - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+ rust-version: ${{env.RUST_VERSION}}
+ jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260901481
##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
with:
rust-version: ${{env.RUST_VERSION}}
jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES="-Pspark-${{matrix.spark-version.short}}
-Pscala-${{matrix.scala-version}}" make release
- name: Setup Iceberg
uses: ./.github/actions/setup-iceberg-builder
with:
iceberg-version: ${{ matrix.iceberg-version.full }}
- scala-version: ${{ matrix.scala-version }}
- spark-short-version: ${{ matrix.spark-version.short }}
- - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
- spark-short-version: ${{ matrix.spark-version.short }}
- spark-version: ${{ matrix.spark-version.full }}
- scala-version: ${{ matrix.scala-version }}
- name: Run Iceberg Spark tests
run: |
cd apache-iceberg
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet
cache requires cleanups
- ENABLE_COMET=true ./gradlew -DsparkVersions=${{
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }}
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+ ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{
matrix.scala-version }}:test \
-Pquick=true -x javadoc
+
+ iceberg-spark-extensions:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+ matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+ fail-fast: false
+name: iceberg-spark-extensions/${{ matrix.os }}/iceberg-${{
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+ image: amd64/rust
+env:
+ SPARK_LOCAL_IP: localhost
+steps:
+ - uses: actions/checkout@v4
+ - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+ rust-version: ${{env.RUST_VERSION}}
+ jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES="-Pspark-${{matrix.spark-version.short}}
-Pscala-${{matrix.scala-version}}" make release
+ - name: Setup Iceberg
+uses: ./.github/actions/setup-iceberg-builder
+with:
+ iceberg-version: ${{ matrix.iceberg-version.full }}
+ - name: Run Iceberg Spark extensions tests
+run: |
+ cd apache-iceberg
+ rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet
cache requires cleanups
+ ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
Review Comment:
Same commet as above.
```suggestion
ENABLE_COMET=true ./gradlew -DsparkVersions=${{
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }}
-DflinkVersions= -DkafkaVersions= \
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260899713
##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
with:
rust-version: ${{env.RUST_VERSION}}
jdk-version: ${{ matrix.java-version }}
+ - name: Build Comet
+shell: bash
+run: |
+ PROFILES="-Pspark-${{matrix.spark-version.short}}
-Pscala-${{matrix.scala-version}}" make release
- name: Setup Iceberg
uses: ./.github/actions/setup-iceberg-builder
with:
iceberg-version: ${{ matrix.iceberg-version.full }}
- scala-version: ${{ matrix.scala-version }}
- spark-short-version: ${{ matrix.spark-version.short }}
- - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
- spark-short-version: ${{ matrix.spark-version.short }}
- spark-version: ${{ matrix.spark-version.full }}
- scala-version: ${{ matrix.scala-version }}
- name: Run Iceberg Spark tests
run: |
cd apache-iceberg
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet
cache requires cleanups
- ENABLE_COMET=true ./gradlew -DsparkVersions=${{
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }}
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+ ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
Review Comment:
I don't think that we want to set `COMET_PARQUET_SCAN_IMPL` here. The
Iceberg integration explicitly calls low level APIS for `native_comet` scan.
```suggestion
ENABLE_COMET=true ./gradlew -DsparkVersions=${{
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }}
-DflinkVersions= -DkafkaVersions= \
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
parthchandra commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3082076029 @hsiang-c created https://github.com/apache/datafusion-comet/issues/2033 to track this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
hsiang-c commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3075575929 Most of the exceptions in Iceberg Spark SQL Tests can be reproduced by 1. Follow the official guide to build Comet and Iceberg, configure Spark shell and populate the Iceberg table: https://datafusion.apache.org/comet/user-guide/iceberg.html 2. Query Iceberg metadata tables with an operator. Here is an example: ```sql -- default is the catalog name used in local HadoopCatalog setup scala> spark.sql(s"SELECT COUNT(*) from default.t1.snapshots").show() 25/07/15 13:06:16 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.ClassCastException: class org.apache.iceberg.spark.source.StructInternalRow cannot be cast to class org.apache.spark.sql.vectorized.ColumnarBatch (org.apache.iceberg.spark.source.StructInternalRow is in unnamed module of loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @19ac93d2; org.apache.spark.sql.vectorized.ColumnarBatch is in unnamed module of loader 'app') at org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:68) at org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:57) at org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:51) at org.apache.comet.Native.executePlan(Native Method) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154) at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154) at org.apache.comet.Tracing$.withTrace(Tracing.scala:31) at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152) at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:50) at org.apache.comet.Native.executePlan(Native Method) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154) at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157) at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154) at org.apache.comet.Tracing$.withTrace(Tracing.scala:31) at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152) at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203) at org.apache.spark.sql.comet.execution.shuffle.CometNativeShuffleWriter.write(CometNativeShuffleWriter.scala:106) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
andygrove commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3049800502 I see that some tests are failing due to https://github.com/apache/datafusion-comet/issues/1982 ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1764.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1764.0 (TID 3312) (localhost executor driver): java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to class org.apache.spark.sql.vectorized.ColumnarBatch (org.apache.spark.sql.catalyst.expressions.GenericInternalRow and org.apache.spark.sql.vectorized.ColumnarBatch are in unnamed module of loader 'app') ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
codecov-commenter commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3033918578 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1987?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 32.72%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`b574107`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/b574107723ec9e9c65fe641e39e55bd47ac4b515?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 309 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1987 +/- ## = - Coverage 56.12% 32.72% -23.41% + Complexity 976 799 -177 = Files 119 133 +14 Lines 1174312951 +1208 Branches 2251 2411 +160 = - Hits 6591 4238 -2353 - Misses 4012 7759 +3747 + Partials 1140 954 -186 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1987?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
hsiang-c commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2183887303
##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -195,6 +195,100 @@ index e2d2c7a..d23acef 100644
relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
relocate 'org.apache.hc.client5',
'org.apache.iceberg.shaded.org.apache.hc.client5'
+diff --git
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+index 578845e..9476f19 100644
+---
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+@@ -57,6 +57,8 @@ public abstract class ExtensionsTestBase extends
CatalogTestBase {
+
.config("spark.sql.legacy.respectNullabilityInTextDatasetConversion", "true")
+ .config(
+ SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(),
String.valueOf(RANDOM.nextBoolean()))
++.config("spark.plugins", "org.apache.spark.CometPlugin")
Review Comment:
@kazuyukitanimura
We're lucky in some cases b/c `TestBase` and `ExtensionsTestBase`
consolidate `SparkSession.Builder` in the abstract class.
Unfortunately, other test classes and `jmh` build their own `SparkSession`
each time :(
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]
kazuyukitanimura commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2183864520
##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -195,6 +195,100 @@ index e2d2c7a..d23acef 100644
relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
relocate 'org.apache.hc.client5',
'org.apache.iceberg.shaded.org.apache.hc.client5'
+diff --git
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+index 578845e..9476f19 100644
+---
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+@@ -57,6 +57,8 @@ public abstract class ExtensionsTestBase extends
CatalogTestBase {
+
.config("spark.sql.legacy.respectNullabilityInTextDatasetConversion", "true")
+ .config(
+ SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(),
String.valueOf(RANDOM.nextBoolean()))
++.config("spark.plugins", "org.apache.spark.CometPlugin")
Review Comment:
This makes sense but could be error prone. If there is a new test that uses
spark session, we miss enabling it.
Wondering if there is a good way to update all spark session at once...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
