Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-19 Thread via GitHub


andygrove merged PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-19 Thread via GitHub


hsiang-c commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2286152103


##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -1,186 +1,721 @@
+diff --git a/build.gradle b/build.gradle
+index 7327b38..7967109 100644
+--- a/build.gradle
 b/build.gradle
+@@ -780,6 +780,13 @@ project(':iceberg-parquet') {
+ implementation project(':iceberg-core')
+ implementation project(':iceberg-common')
+ 
++
implementation("org.apache.datafusion:comet-spark-spark${sparkVersionsString}_${scalaVersion}:${libs.versions.comet.get()}")
 {
++  exclude group: 'org.apache.arrow'
++  exclude group: 'org.apache.parquet'
++  exclude group: 'org.apache.spark'
++  exclude group: 'org.apache.iceberg'
++}
++
+ implementation(libs.parquet.avro) {
+   exclude group: 'org.apache.avro', module: 'avro'
+   // already shaded by Parquet
 diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
-index 04ffa8f..d4107be 100644
+index 04ffa8f..cc0099c 100644
 --- a/gradle/libs.versions.toml
 +++ b/gradle/libs.versions.toml
-@@ -81,7 +81,7 @@ slf4j = "2.0.16"
+@@ -34,6 +34,7 @@ azuresdk-bom = "1.2.31"
+ awssdk-s3accessgrants = "2.3.0"
+ caffeine = "2.9.3"
+ calcite = "1.10.0"
++comet = "0.10.0-SNAPSHOT"
+ datasketches = "6.2.0"
+ delta-standalone = "3.3.0"
+ delta-spark = "3.3.0"
+@@ -81,7 +82,7 @@ slf4j = "2.0.16"
  snowflake-jdbc = "3.22.0"
  spark-hive33 = "3.3.4"
  spark-hive34 = "3.4.4"
 -spark-hive35 = "3.5.4"
-+spark-hive35 = "3.5.6-SNAPSHOT"
++spark-hive35 = "3.5.6"
  sqlite-jdbc = "3.48.0.0"
  testcontainers = "1.20.4"
  tez010 = "0.10.4"
-diff --git a/spark/v3.4/build.gradle b/spark/v3.4/build.gradle
-index 6eb26e8..50cefce 100644
 a/spark/v3.4/build.gradle
-+++ b/spark/v3.4/build.gradle
-@@ -75,7 +75,7 @@ 
project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
-   exclude group: 'org.roaringbitmap'
- }
- 
--compileOnly 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+compileOnly 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- implementation libs.parquet.column
- implementation libs.parquet.hadoop
-@@ -185,7 +185,7 @@ 
project(":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVer
- testImplementation libs.avro.avro
- testImplementation libs.parquet.hadoop
- testImplementation libs.junit.vintage.engine
--testImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+testImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- // Required because we remove antlr plugin dependencies from the compile 
configuration, see note above
- runtimeOnly libs.antlr.runtime
-@@ -260,6 +260,8 @@ 
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- integrationImplementation project(path: ':iceberg-hive-metastore', 
configuration: 'testArtifacts')
- integrationImplementation project(path: 
":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}", 
configuration: 'testArtifacts')
- integrationImplementation project(path: 
":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}", 
configuration: 'testArtifacts')
-+integrationImplementation project(path: ':iceberg-parquet')
-+integrationImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- // runtime dependencies for running Hive Catalog based integration test
- integrationRuntimeOnly project(':iceberg-hive-metastore')
-@@ -297,8 +299,8 @@ 
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- relocate 'org.apache.avro', 'org.apache.iceberg.shaded.org.apache.avro'
- relocate 'avro.shaded', 'org.apache.iceberg.shaded.org.apache.avro.shaded'
- relocate 'com.thoughtworks.paranamer', 
'org.apache.iceberg.shaded.com.thoughtworks.paranamer'
--relocate 'org.apache.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet'
--relocate 'shaded.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
-+//relocate 'org.apache.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet'
-+//relocate 'shaded.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
- relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
- relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
- relocate 'org.apache.hc.client5', 
'org.apache.iceberg.shaded.org.apache.hc.client5'
-diff --git 
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
 
b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-index 0ca1236..87daef4 100644
 
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-+++ 
b/spark/v3.4/spark

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-19 Thread via GitHub


andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2285513673


##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -1,186 +1,721 @@
+diff --git a/build.gradle b/build.gradle
+index 7327b38..7967109 100644
+--- a/build.gradle
 b/build.gradle
+@@ -780,6 +780,13 @@ project(':iceberg-parquet') {
+ implementation project(':iceberg-core')
+ implementation project(':iceberg-common')
+ 
++
implementation("org.apache.datafusion:comet-spark-spark${sparkVersionsString}_${scalaVersion}:${libs.versions.comet.get()}")
 {
++  exclude group: 'org.apache.arrow'
++  exclude group: 'org.apache.parquet'
++  exclude group: 'org.apache.spark'
++  exclude group: 'org.apache.iceberg'
++}
++
+ implementation(libs.parquet.avro) {
+   exclude group: 'org.apache.avro', module: 'avro'
+   // already shaded by Parquet
 diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
-index 04ffa8f..d4107be 100644
+index 04ffa8f..cc0099c 100644
 --- a/gradle/libs.versions.toml
 +++ b/gradle/libs.versions.toml
-@@ -81,7 +81,7 @@ slf4j = "2.0.16"
+@@ -34,6 +34,7 @@ azuresdk-bom = "1.2.31"
+ awssdk-s3accessgrants = "2.3.0"
+ caffeine = "2.9.3"
+ calcite = "1.10.0"
++comet = "0.10.0-SNAPSHOT"
+ datasketches = "6.2.0"
+ delta-standalone = "3.3.0"
+ delta-spark = "3.3.0"
+@@ -81,7 +82,7 @@ slf4j = "2.0.16"
  snowflake-jdbc = "3.22.0"
  spark-hive33 = "3.3.4"
  spark-hive34 = "3.4.4"
 -spark-hive35 = "3.5.4"
-+spark-hive35 = "3.5.6-SNAPSHOT"
++spark-hive35 = "3.5.6"
  sqlite-jdbc = "3.48.0.0"
  testcontainers = "1.20.4"
  tez010 = "0.10.4"
-diff --git a/spark/v3.4/build.gradle b/spark/v3.4/build.gradle
-index 6eb26e8..50cefce 100644
 a/spark/v3.4/build.gradle
-+++ b/spark/v3.4/build.gradle
-@@ -75,7 +75,7 @@ 
project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
-   exclude group: 'org.roaringbitmap'
- }
- 
--compileOnly 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+compileOnly 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- implementation libs.parquet.column
- implementation libs.parquet.hadoop
-@@ -185,7 +185,7 @@ 
project(":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVer
- testImplementation libs.avro.avro
- testImplementation libs.parquet.hadoop
- testImplementation libs.junit.vintage.engine
--testImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
-+testImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- // Required because we remove antlr plugin dependencies from the compile 
configuration, see note above
- runtimeOnly libs.antlr.runtime
-@@ -260,6 +260,8 @@ 
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- integrationImplementation project(path: ':iceberg-hive-metastore', 
configuration: 'testArtifacts')
- integrationImplementation project(path: 
":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}", 
configuration: 'testArtifacts')
- integrationImplementation project(path: 
":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}", 
configuration: 'testArtifacts')
-+integrationImplementation project(path: ':iceberg-parquet')
-+integrationImplementation 
"org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.10.0-SNAPSHOT"
- 
- // runtime dependencies for running Hive Catalog based integration test
- integrationRuntimeOnly project(':iceberg-hive-metastore')
-@@ -297,8 +299,8 @@ 
project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
- relocate 'org.apache.avro', 'org.apache.iceberg.shaded.org.apache.avro'
- relocate 'avro.shaded', 'org.apache.iceberg.shaded.org.apache.avro.shaded'
- relocate 'com.thoughtworks.paranamer', 
'org.apache.iceberg.shaded.com.thoughtworks.paranamer'
--relocate 'org.apache.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet'
--relocate 'shaded.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
-+//relocate 'org.apache.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet'
-+//relocate 'shaded.parquet', 
'org.apache.iceberg.shaded.org.apache.parquet.shaded'
- relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
- relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
- relocate 'org.apache.hc.client5', 
'org.apache.iceberg.shaded.org.apache.hc.client5'
-diff --git 
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
 
b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-index 0ca1236..87daef4 100644
 
a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java
-+++ 
b/spark/v3.4/spar

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-07 Thread via GitHub


andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260902271


##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
 with:
   rust-version: ${{env.RUST_VERSION}}
   jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES="-Pspark-${{matrix.spark-version.short}} 
-Pscala-${{matrix.scala-version}}" make release
   - name: Setup Iceberg
 uses: ./.github/actions/setup-iceberg-builder
 with:
   iceberg-version: ${{ matrix.iceberg-version.full }}
-  scala-version: ${{ matrix.scala-version }}
-  spark-short-version: ${{ matrix.spark-version.short }}
-  - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
-  spark-short-version: ${{ matrix.spark-version.short }}
-  spark-version: ${{ matrix.spark-version.full }}
-  scala-version: ${{ matrix.scala-version }}
   - name: Run Iceberg Spark tests
 run: |
   cd apache-iceberg
   rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet 
cache requires cleanups
-  ENABLE_COMET=true ./gradlew -DsparkVersions=${{ 
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} 
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{ 
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+  ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat 
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ 
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{ 
matrix.scala-version }}:test \
 -Pquick=true -x javadoc
+
+  iceberg-spark-extensions:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+  matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+  fail-fast: false
+name: iceberg-spark-extensions/${{ matrix.os }}/iceberg-${{ 
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ 
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+  image: amd64/rust
+env:
+  SPARK_LOCAL_IP: localhost
+steps:
+  - uses: actions/checkout@v4
+  - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+  rust-version: ${{env.RUST_VERSION}}
+  jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES="-Pspark-${{matrix.spark-version.short}} 
-Pscala-${{matrix.scala-version}}" make release
+  - name: Setup Iceberg
+uses: ./.github/actions/setup-iceberg-builder
+with:
+  iceberg-version: ${{ matrix.iceberg-version.full }}
+  - name: Run Iceberg Spark extensions tests
+run: |
+  cd apache-iceberg
+  rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet 
cache requires cleanups
+  ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat 
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ 
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-extensions-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:test \
+-Pquick=true -x javadoc
+
+  iceberg-spark-runtime:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+  matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+  fail-fast: false
+name: iceberg-spark-runtime/${{ matrix.os }}/iceberg-${{ 
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ 
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+  image: amd64/rust
+env:
+  SPARK_LOCAL_IP: localhost
+steps:
+  - uses: actions/checkout@v4
+  - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+  rust-version: ${{env.RUST_VERSION}}
+  jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-07 Thread via GitHub


andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260901481


##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
 with:
   rust-version: ${{env.RUST_VERSION}}
   jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES="-Pspark-${{matrix.spark-version.short}} 
-Pscala-${{matrix.scala-version}}" make release
   - name: Setup Iceberg
 uses: ./.github/actions/setup-iceberg-builder
 with:
   iceberg-version: ${{ matrix.iceberg-version.full }}
-  scala-version: ${{ matrix.scala-version }}
-  spark-short-version: ${{ matrix.spark-version.short }}
-  - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
-  spark-short-version: ${{ matrix.spark-version.short }}
-  spark-version: ${{ matrix.spark-version.full }}
-  scala-version: ${{ matrix.scala-version }}
   - name: Run Iceberg Spark tests
 run: |
   cd apache-iceberg
   rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet 
cache requires cleanups
-  ENABLE_COMET=true ./gradlew -DsparkVersions=${{ 
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} 
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{ 
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+  ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat 
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ 
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \
+:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{ 
matrix.scala-version }}:test \
 -Pquick=true -x javadoc
+
+  iceberg-spark-extensions:
+if: contains(github.event.pull_request.title, '[iceberg]')
+strategy:
+  matrix:
+os: [ubuntu-24.04]
+java-version: [11, 17]
+iceberg-version: [{short: '1.8', full: '1.8.1'}]
+spark-version: [{short: '3.5', full: '3.5.6'}]
+scala-version: ['2.13']
+  fail-fast: false
+name: iceberg-spark-extensions/${{ matrix.os }}/iceberg-${{ 
matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ 
matrix.scala-version }}/java-${{ matrix.java-version }}
+runs-on: ${{ matrix.os }}
+container:
+  image: amd64/rust
+env:
+  SPARK_LOCAL_IP: localhost
+steps:
+  - uses: actions/checkout@v4
+  - name: Setup Rust & Java toolchain
+uses: ./.github/actions/setup-builder
+with:
+  rust-version: ${{env.RUST_VERSION}}
+  jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES="-Pspark-${{matrix.spark-version.short}} 
-Pscala-${{matrix.scala-version}}" make release
+  - name: Setup Iceberg
+uses: ./.github/actions/setup-iceberg-builder
+with:
+  iceberg-version: ${{ matrix.iceberg-version.full }}
+  - name: Run Iceberg Spark extensions tests
+run: |
+  cd apache-iceberg
+  rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet 
cache requires cleanups
+  ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat 
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ 
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \

Review Comment:
   Same commet as above.
   
   ```suggestion
 ENABLE_COMET=true ./gradlew -DsparkVersions=${{ 
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} 
-DflinkVersions= -DkafkaVersions= \
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-08-07 Thread via GitHub


andygrove commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2260899713


##
.github/workflows/iceberg_spark_test.yml:
##
@@ -63,24 +63,96 @@ jobs:
 with:
   rust-version: ${{env.RUST_VERSION}}
   jdk-version: ${{ matrix.java-version }}
+  - name: Build Comet
+shell: bash
+run: |
+  PROFILES="-Pspark-${{matrix.spark-version.short}} 
-Pscala-${{matrix.scala-version}}" make release
   - name: Setup Iceberg
 uses: ./.github/actions/setup-iceberg-builder
 with:
   iceberg-version: ${{ matrix.iceberg-version.full }}
-  scala-version: ${{ matrix.scala-version }}
-  spark-short-version: ${{ matrix.spark-version.short }}
-  - name: Build local Spark jar with comet patch
-uses: ./.github/actions/setup-spark-local-jar
-with:
-  spark-short-version: ${{ matrix.spark-version.short }}
-  spark-version: ${{ matrix.spark-version.full }}
-  scala-version: ${{ matrix.scala-version }}
   - name: Run Iceberg Spark tests
 run: |
   cd apache-iceberg
   rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet 
cache requires cleanups
-  ENABLE_COMET=true ./gradlew -DsparkVersions=${{ 
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} 
-DflinkVersions= -DkafkaVersions= \
-:iceberg-spark:iceberg-spark-${{ matrix.spark-version.short }}_${{ 
matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-extensions-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
-:iceberg-spark:iceberg-spark-runtime-${{ 
matrix.spark-version.short }}_${{ matrix.scala-version }}:check \
+  ENABLE_COMET=true COMET_PARQUET_SCAN_IMPL=native_iceberg_compat 
./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ 
matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \

Review Comment:
   
   I don't think that we want to set `COMET_PARQUET_SCAN_IMPL` here. The 
Iceberg integration explicitly calls low level APIS for `native_comet` scan.
   
   ```suggestion
 ENABLE_COMET=true ./gradlew -DsparkVersions=${{ 
matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} 
-DflinkVersions= -DkafkaVersions= \
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-16 Thread via GitHub


parthchandra commented on PR #1987:
URL: 
https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3082076029

   @hsiang-c created https://github.com/apache/datafusion-comet/issues/2033 to 
track this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-15 Thread via GitHub


hsiang-c commented on PR #1987:
URL: 
https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3075575929

   Most of the exceptions in Iceberg Spark SQL Tests can be reproduced by
   
   1. Follow the official guide to build Comet and Iceberg, configure Spark 
shell and populate the Iceberg table: 
https://datafusion.apache.org/comet/user-guide/iceberg.html
   2. Query Iceberg metadata tables with an operator. Here is an example:
   
   ```sql
   -- default is the catalog name used in local HadoopCatalog setup
   scala> spark.sql(s"SELECT COUNT(*) from default.t1.snapshots").show()
   
   25/07/15 13:06:16 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
   java.lang.ClassCastException: class 
org.apache.iceberg.spark.source.StructInternalRow cannot be cast to class 
org.apache.spark.sql.vectorized.ColumnarBatch 
(org.apache.iceberg.spark.source.StructInternalRow is in unnamed module of 
loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @19ac93d2; 
org.apache.spark.sql.vectorized.ColumnarBatch is in unnamed module of loader 
'app')
at 
org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:68)
at 
org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:57)
at 
org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:51)
at org.apache.comet.Native.executePlan(Native Method)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154)
at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154)
at org.apache.comet.Tracing$.withTrace(Tracing.scala:31)
at 
org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152)
at 
org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:50)
at org.apache.comet.Native.executePlan(Native Method)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154)
at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
at 
org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154)
at org.apache.comet.Tracing$.withTrace(Tracing.scala:31)
at 
org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152)
at 
org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203)
at 
org.apache.spark.sql.comet.execution.shuffle.CometNativeShuffleWriter.write(CometNativeShuffleWriter.scala:106)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-08 Thread via GitHub


andygrove commented on PR #1987:
URL: 
https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3049800502

   I see that some tests are failing due to 
https://github.com/apache/datafusion-comet/issues/1982
   
   ```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1764.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
1764.0 (TID 3312) (localhost executor driver): java.lang.ClassCastException: 
class org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be 
cast to class org.apache.spark.sql.vectorized.ColumnarBatch 
(org.apache.spark.sql.catalyst.expressions.GenericInternalRow and 
org.apache.spark.sql.vectorized.ColumnarBatch are in unnamed module of loader 
'app')
```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-03 Thread via GitHub


codecov-commenter commented on PR #1987:
URL: 
https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3033918578

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1987?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 32.72%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`b574107`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/b574107723ec9e9c65fe641e39e55bd47ac4b515?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 309 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1987   +/-   ##
   =
   - Coverage 56.12%   32.72%   -23.41% 
   + Complexity  976  799  -177 
   =
 Files   119  133   +14 
 Lines 1174312951 +1208 
 Branches   2251 2411  +160 
   =
   - Hits   6591 4238 -2353 
   - Misses 4012 7759 +3747 
   + Partials   1140  954  -186 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1987?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-03 Thread via GitHub


hsiang-c commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2183887303


##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -195,6 +195,100 @@ index e2d2c7a..d23acef 100644
  relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
  relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
  relocate 'org.apache.hc.client5', 
'org.apache.iceberg.shaded.org.apache.hc.client5'
+diff --git 
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
 
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+index 578845e..9476f19 100644
+--- 
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
 
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+@@ -57,6 +57,8 @@ public abstract class ExtensionsTestBase extends 
CatalogTestBase {
+ 
.config("spark.sql.legacy.respectNullabilityInTextDatasetConversion", "true")
+ .config(
+ SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), 
String.valueOf(RANDOM.nextBoolean()))
++.config("spark.plugins", "org.apache.spark.CometPlugin")

Review Comment:
   @kazuyukitanimura
   
   We're lucky in some cases  b/c `TestBase` and `ExtensionsTestBase` 
consolidate `SparkSession.Builder` in the abstract class.
   
   Unfortunately, other test classes and `jmh` build their own `SparkSession` 
each time :(



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-03 Thread via GitHub


kazuyukitanimura commented on code in PR #1987:
URL: https://github.com/apache/datafusion-comet/pull/1987#discussion_r2183864520


##
dev/diffs/iceberg/1.8.1.diff:
##
@@ -195,6 +195,100 @@ index e2d2c7a..d23acef 100644
  relocate 'org.apache.orc', 'org.apache.iceberg.shaded.org.apache.orc'
  relocate 'io.airlift', 'org.apache.iceberg.shaded.io.airlift'
  relocate 'org.apache.hc.client5', 
'org.apache.iceberg.shaded.org.apache.hc.client5'
+diff --git 
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
 
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+index 578845e..9476f19 100644
+--- 
a/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
 
b/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java
+@@ -57,6 +57,8 @@ public abstract class ExtensionsTestBase extends 
CatalogTestBase {
+ 
.config("spark.sql.legacy.respectNullabilityInTextDatasetConversion", "true")
+ .config(
+ SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), 
String.valueOf(RANDOM.nextBoolean()))
++.config("spark.plugins", "org.apache.spark.CometPlugin")

Review Comment:
   This makes sense but could be error prone. If there is a new test that uses 
spark session, we miss enabling it.
   Wondering if there is a good way to update all spark session at once...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]