Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


huaxingao commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167873990


##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions 
options, ReadOptions cometO
 this(file, null, options, cometOptions, null);
   }
 
+  /** This constructor is called from Apache Iceberg. */
+  public FileReader(
+  Path path,
+  Configuration conf,
+  ReadOptions cometOptions,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix)
+  throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length, 
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);

Review Comment:
   I think we can use `CometInputFile`. Modified.



##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List 
projection) {
 }
   }
 
+  /** This method is called from Apache Iceberg. */
+  public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+  ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+  paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+  }
+
+  private static ParquetReadOptions buildParquetReadOptions(
+  Configuration conf,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167782872


##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions 
options, ReadOptions cometO
 this(file, null, options, cometOptions, null);
   }
 
+  /** This constructor is called from Apache Iceberg. */
+  public FileReader(
+  Path path,
+  Configuration conf,
+  ReadOptions cometOptions,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix)
+  throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length, 
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);

Review Comment:
   Can we use `CometInputFile` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167786141


##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List 
projection) {
 }
   }
 
+  /** This method is called from Apache Iceberg. */
+  public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+  ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+  paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+  }
+
+  private static ParquetReadOptions buildParquetReadOptions(
+  Configuration conf,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",

Review Comment:
   Could we use these constants since they are not in the method signature?
   
   ```java
   ParquetInputFormat.UNBOUND_RECORD_FILTER,
   ParquetInputFormat.FILTER_PREDICATE,
   ParquetInputFormat.READ_SUPPORT_CLASS,
   EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167782872


##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions 
options, ReadOptions cometO
 this(file, null, options, cometOptions, null);
   }
 
+  /** This constructor is called from Apache Iceberg. */
+  public FileReader(
+  Path path,
+  Configuration conf,
+  ReadOptions cometOptions,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix)
+  throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length, 
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);

Review Comment:
   Can we use `CometInputFile` here? According to the doc
   
   ```java
   /**
* A Parquet {@link InputFile} implementation that's similar to {@link
* org.apache.parquet.hadoop.util.HadoopInputFile}, but with optimizations 
introduced in Hadoop 3.x,
* for S3 specifically.
*/
   public class CometInputFile implements InputFile {
   // omitted
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167786141


##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List 
projection) {
 }
   }
 
+  /** This method is called from Apache Iceberg. */
+  public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+  ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+  paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+  }
+
+  private static ParquetReadOptions buildParquetReadOptions(
+  Configuration conf,
+  Map properties,
+  Long start,
+  Long length,
+  byte[] fileEncryptionKey,
+  byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",

Review Comment:
   Could we use constants such as  `ParquetInputFormat.UNBOUND_RECORD_FILTER` 
or doing so will again create the dependency we don't want?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


hsiang-c commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3006454989

   
   > In my local copy of Iceberg, I updated SparkBatchQueryScan to implement 
SupportsComet.  
   
   @andygrove You can apply the diff to Iceberg 1.8.1 for Comet support. I'll 
add Iceberg `1.9.1` diff soon. 
   
   
https://github.com/apache/datafusion-comet/blob/main/dev/diffs/iceberg/1.8.1.diff#L236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3006159227

   I am now able to get this working end-to-end :tada: 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005044455

   The `NoSuchMethodError` error was my mistake. I was building Comet for Spark 
3.4 but using the jar for Spark 3.5 when testing. I no longer see any errors, 
but Comet is not accelerating the Iceberg scan:
   
   ```
   COMET: Unsupported scan: 
org.apache.iceberg.spark.source.SparkBatchQueryScan. Comet Scan only supports 
Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is 
not supported
   ```
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167562231


##
common/pom.xml:
##
@@ -26,7 +26,7 @@ under the License.
   
 org.apache.datafusion
 
comet-parent-spark${spark.version.short}_${scala.binary.version}
-0.9.0-SNAPSHOT
+0.9.0.1-SNAPSHOT

Review Comment:
   @huaxingao I'm assuming that the version bump was not intentional?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


snmvaughan commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005663410

   I'm surprised we don't have a Comet interface which provides the access 
needed by Comet, in combination with an Iceberg implementation of that 
interface that wraps the Iceberg details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005415617

   In my local copy of Iceberg, I updated `SparkBatchQueryScan` to implement 
`SupportsComet`. When using both the Comet and Iceberg jars on the classpath, 
Comet is unable to recognize that `SparkBatchQueryScan` implements 
`SupportsComet`, and I think this is because there are two copies of the 
`SupportsComet` interface (one in the Comet jar and one in the Iceberg jar).
   
   If I just have the Iceberg jar on the classpath then Comet does try and 
accelerate the scan, but now I am running into Arrow shading issues.
   
   Spark 3.4.3 uses Arrow 11.0.0
   Comet uses Arrow 18.3.0
   Iceberg uses Arrow 15.0.2
   
   It is not possible to shade the JNI classes in Arrow because the Java names 
have to match the function names in the native code, so this is quite 
challenging to resolve. We may need to look into using class loader isolation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3004952533

   I added the following method to `FileReader` locally:
   
   ```scala
 /** Sets the projected columns to be read later via {@link 
#readNextRowGroup()} */
 public void setRequestedSchemaFromSpecs(List 
projection) {
   paths.clear();
   for (ParquetColumnSpec columnSpec : projection) {
 ColumnDescriptor col = Utils.buildColumnDescriptor(columnSpec);
 paths.put(ColumnPath.get(col.getPath()), col);
   }
 }
   ```
   
   I can now compile Iceberg, but I get an exception at runtime, and I do not 
yet understand why:
   
   ```
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /___/ .__/\_,_/_/ /_/\_\   version 3.4.3
 /_/

   Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.27)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> spark.sql(s"CREATE TABLE IF NOT EXISTS t1 (c0 INT, c1 STRING) USING 
iceberg")
   25/06/25 08:12:29 INFO core/src/lib.rs: Comet native library version 0.9.0 
initialized
   25/06/25 08:12:29 WARN CometExecRule: Comet cannot execute some parts of 
this plan natively (set spark.comet.explainFallback.enabled=false to disable 
this logging):
CreateTable [COMET: CreateTable is not supported]
   
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql(s"INSERT INTO t1 VALUES ${(0 until 1).map(i => (i, 
i)).mkString(",")}")
   25/06/25 08:12:32 WARN CometExecRule: Comet cannot execute some parts of 
this plan natively (set spark.comet.explainFallback.enabled=false to disable 
this logging):
AppendData [COMET: AppendData is not supported]
   +-  LocalTableScan [COMET: LocalTableScan is not supported]
   
   res1: org.apache.spark.sql.DataFrame = []
   
   
   scala> spark.sql(s"SELECT * from t1").show()
   25/06/25 08:12:35 WARN CometExecRule: Comet cannot execute some parts of 
this plan natively (set spark.comet.explainFallback.enabled=false to disable 
this logging):
CollectLimit [COMET: CollectLimit is not supported]
   +- Project
  +-  BatchScan spark_catalog.default.t1 [COMET: Unsupported scan: 
org.apache.iceberg.spark.source.SparkBatchQueryScan. Comet Scan only supports 
Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is 
not supported]
   
   25/06/25 08:12:35 WARN CheckAllocator: More than one 
DefaultAllocationManager on classpath. Choosing first found
   25/06/25 08:12:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 32)
   java.lang.NoSuchMethodError: 'void 
org.apache.comet.parquet.FileReader.setRequestedSchemaFromSpecs(java.util.List)'
at 
org.apache.iceberg.parquet.CometVectorizedParquetReader$FileIterator.newCometReader(CometVectorizedParquetReader.java:222)
   ```
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-25 Thread via GitHub


andygrove commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3004865649

   I am testing this locally now. There is still one API call that references a 
Parquet class, causing Iceberg to fail to compile:
   
   ```
   
/home/andy/git/apache/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/CometVectorizedParquetReader.java:222:
 error: cannot find symbol
   fileReader.setRequestedSchemaFromSpecs(specs);
   ```
   
   This method signature is `public void 
setRequestedSchema(List projection)`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-24 Thread via GitHub


huaxingao commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3002056893

   cc @andygrove @parthchandra @hsiang-c 
   Could you please review this PR? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-24 Thread via GitHub


huaxingao commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3002050043

   I have a draft iceberg [PR](https://github.com/apache/iceberg/pull/13378)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-24 Thread via GitHub


andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2164143550


##
common/src/main/java/org/apache/comet/parquet/ColumnReader.java:
##
@@ -126,6 +126,13 @@ public void setPageReader(PageReader pageReader) throws 
IOException {
 }
   }
 
+  /** This method is called from Apache Iceberg. */
+  public void setRowGroupReader(RowGroupReader rowGroupReader, 
ParquetColumnSpec columnSpec)

Review Comment:
   I filed https://github.com/apache/datafusion-comet/issues/1928



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-23 Thread via GitHub


huaxingao commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-2997669574

   I somehow got some strange errors:
   ```
   [info] ParquetV1QuerySuite:
   [info] - simple select queries (635 milliseconds)
   [info] - appending (254 milliseconds)
   [info] - overwriting (562 milliseconds)
   [info] - self-join (258 milliseconds)
   [info] - nested data - struct with array field (406 milliseconds)
   [info] - nested data - array of struct (390 milliseconds)
   [info] - SPARK-1913 regression: columns only referenced by pushed down 
filters should remain (363 milliseconds)
   [info] - SPARK-5309 strings stored using dictionary compression in parquet 
(821 milliseconds)
   [info] - SPARK-6917 DecimalType should work with non-native types (201 
milliseconds)
   [info] - SPARK-10634 timestamp written and read as INT64 - truncation (331 
milliseconds)
   [info] - SPARK-36182, SPARK-47368: writing and reading TimestampNTZType 
column (750 milliseconds)
   17:56:39.487 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
in stage 140.0 (TID 215)
   org.apache.spark.SparkException: Encountered error while reading file 
file:///__w/datafusion-comet/datafusion-comet/apache-spark/target/tmp/spark-8a25ab9e-6dec-41e3-8354-ceb405cbd9d5/part-1-22a1399a-9d0c-4972-bf97-e6bdffa1c9cd-c000.snappy.parquet.
 Details:
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:864)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:296)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
at 
org.apache.spark.sql.comet.CometScanExec$$anon$1.hasNext(CometScanExec.scala:266)
   ```
   
   ```
   17:57:05.557 WARN 
org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite: 
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.parquet.ParquetV1QuerySuite, threads: 
QueryStageCreator-47 (daemon=true), QueryStageCreator-52 (daemon=true), 
QueryStageCreator-58 (daemon=true), QueryStageCreator-54 (daemon=true), 
comet-broadcast-exchange-130 (daemon=true), shuffle-boss-714-1 (daemon=true), 
QueryStageCreator-57 (daemon=true), QueryStageCreator-46 (daemon=true), 
QueryStageCreator-59 (daemon=true), rpc-boss-711-1 (daemon=true), 
QueryStageCreator-51 (daemon=true), QuerySta...
   
   [info] 
org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite *** 
ABORTED *** (31 seconds, 172 milliseconds)
   [info]   The code passed to eventually never returned normally. Attempted 15 
times over 10.06552673901 seconds. Last failure message: There are 12 
possibly leaked file streams.. (SharedSparkSession.scala:189)
   [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
   [info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
   [info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
   [info]   at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
   [info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:47)
   [info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:189)
   [info]   at 
org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:183)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.afterEach(ParquetQuerySuite.scala:47)
   [info]   at 
org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247)
   [info]   at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377)
   [info]   at 
org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373)
   [info]   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:462)
   [info]   at org.scalatest.Status.withAfterEffect(Status.scala:373)
   [info]   at org.scalatest.Status.withAfterEffect$(Status.scala:371)
   [info]   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:434)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:246)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
   ```
   I am still trying to figure out whether these errors are caused by my changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: github-h...@

Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-23 Thread via GitHub


andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2162206037


##
common/src/main/java/org/apache/comet/parquet/ColumnReader.java:
##
@@ -126,6 +126,13 @@ public void setPageReader(PageReader pageReader) throws 
IOException {
 }
   }
 
+  /** This method is called from Apache Iceberg. */
+  public void setRowGroupReader(RowGroupReader rowGroupReader, 
ParquetColumnSpec columnSpec)

Review Comment:
   It would be good to have some unit tests in Comet for the methods intended 
to be called from Iceberg, so that we catch any regressions in behavior. This 
could be added as a separate PR so that we don't slow down progress on the 
integration work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-20 Thread via GitHub


huaxingao closed pull request #1920: feat: Encapsulate Parquet objects
URL: https://github.com/apache/datafusion-comet/pull/1920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-20 Thread via GitHub


codecov-commenter commented on PR #1920:
URL: 
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-2992325126

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `14.89362%` with `80 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 32.79%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`38f9c8c`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/38f9c8cd03230297125d6ad3cf6ee520a67fc6b5?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 275 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...main/java/org/apache/comet/parquet/FileReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FFileReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0ZpbGVSZWFkZXIuamF2YQ==)
 | 18.96% | [46 Missing and 1 partial :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[.../src/main/java/org/apache/comet/parquet/Utils.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FUtils.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1V0aWxzLmphdmE=)
 | 0.00% | [15 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...va/org/apache/comet/parquet/ParquetColumnSpec.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FParquetColumnSpec.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1BhcnF1ZXRDb2x1bW5TcGVjLmphdmE=)
 | 0.00% | [14 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[...in/java/org/apache/comet/parquet/ColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbHVtblJlYWRlci5qYXZh)
 | 0.00% | [3 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   | 
[.../java/org/apache/comet/parquet/RowGroupReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FRowGroupReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1Jvd0dyb3VwUmVhZGVyLmphdmE=)
 | 75.00% | [1 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1920   +/-   ##
   =
   - Coverage 56.12%   32.79%   -23.34% 
   + Complexity  976  806  -170 
   =
 Files   119  131   +12 
 Lines 1174312909 +1166 
 Branches   2251 2407  +156 
   

[PR] feat: Encapsulate Parquet objects [datafusion-comet]

2025-06-20 Thread via GitHub


huaxingao opened a new pull request, #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   Iceberg shades Parquet. We can't pass Parquet objects from Iceberg to Comet. 
In order to get around this problem, this PR encapsulates the Parquet objects.
   
   Here is the summary of the changes:
   Iceberg call these APIs:
   ```
   public static ColumnReader getColumnReader(
  DataType type,
  ColumnDescriptor descriptor,
  CometSchemaImporter importer,
  int batchSize,
  boolean useDecimal128,
  boolean useLazyMaterialization) 
   
   ColumnReader.setPageReader(PageReader pageReader)
   ```
   In order to encapsulate `ColumnDescriptor` and `PageReader`, will add a 
`ParquetColumnSpec`, change the above two APIs to
   
   ```
   public static ColumnReader getColumnReader(
  DataType type,
  ParquetColumnSpec columnSpec,
  CometSchemaImporter importer,
  int batchSize,
  boolean useDecimal128,
  boolean useLazyMaterialization)
   // construct a ColumnDescriptor from ParquetColumnSpec
   
   setRowGroupReader(org.apache.comet.parquet.RowGroupReader rowGroupReader, 
ParquetColumnSpec columnSpec)
   // Will call PageReader pageReader = 
RowGroupReader.getPageReader(ColumnDescriptor)
   // setPageReader(pageReader);
   ```
   In order to call `setRowGroupReader(org.apache.comet.parquet.RowGroupReader 
rowGroupReader, ParquetColumnSpec columnSpec)`, in Iceberg side, will need to 
use Comet's `FileReader` instead of `ParquetFileReader`, so we will call 
`FileReader.readNextRowGroup()` to get a 
`org.apache.comet.parquet.RowGroupReader` instead Parquet's `PageReadStore`.
   
   `ParquetReadOption` can't be passed directly either, so the related info are 
passed and `ParquetReadOption` is built on Comet. 
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   I did integration test on my local.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]