Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167873990
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions
options, ReadOptions cometO
this(file, null, options, cometOptions, null);
}
+ /** This constructor is called from Apache Iceberg. */
+ public FileReader(
+ Path path,
+ Configuration conf,
+ ReadOptions cometOptions,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix)
+ throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length,
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);
Review Comment:
I think we can use `CometInputFile`. Modified.
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List
projection) {
}
}
+ /** This method is called from Apache Iceberg. */
+ public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+ ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+ paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+ }
+
+ private static ParquetReadOptions buildParquetReadOptions(
+ Configuration conf,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",
Review Comment:
Changed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167782872
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions
options, ReadOptions cometO
this(file, null, options, cometOptions, null);
}
+ /** This constructor is called from Apache Iceberg. */
+ public FileReader(
+ Path path,
+ Configuration conf,
+ ReadOptions cometOptions,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix)
+ throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length,
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);
Review Comment:
Can we use `CometInputFile` here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167786141
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List
projection) {
}
}
+ /** This method is called from Apache Iceberg. */
+ public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+ ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+ paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+ }
+
+ private static ParquetReadOptions buildParquetReadOptions(
+ Configuration conf,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",
Review Comment:
Could we use these constants since they are not in the method signature?
```java
ParquetInputFormat.UNBOUND_RECORD_FILTER,
ParquetInputFormat.FILTER_PREDICATE,
ParquetInputFormat.READ_SUPPORT_CLASS,
EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167782872
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -128,6 +134,48 @@ public FileReader(InputFile file, ParquetReadOptions
options, ReadOptions cometO
this(file, null, options, cometOptions, null);
}
+ /** This constructor is called from Apache Iceberg. */
+ public FileReader(
+ Path path,
+ Configuration conf,
+ ReadOptions cometOptions,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix)
+ throws IOException {
+ParquetReadOptions options =
+buildParquetReadOptions(conf, properties, start, length,
fileEncryptionKey, fileAADPrefix);
+this.converter = new ParquetMetadataConverter(options);
+this.file = HadoopInputFile.fromPath(path, conf);
Review Comment:
Can we use `CometInputFile` here? According to the doc
```java
/**
* A Parquet {@link InputFile} implementation that's similar to {@link
* org.apache.parquet.hadoop.util.HadoopInputFile}, but with optimizations
introduced in Hadoop 3.x,
* for S3 specifically.
*/
public class CometInputFile implements InputFile {
// omitted
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
hsiang-c commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167786141
##
common/src/main/java/org/apache/comet/parquet/FileReader.java:
##
@@ -209,6 +257,55 @@ public void setRequestedSchema(List
projection) {
}
}
+ /** This method is called from Apache Iceberg. */
+ public void setRequestedSchemaFromSpecs(List specList) {
+paths.clear();
+for (ParquetColumnSpec colSpec : specList) {
+ ColumnDescriptor descriptor = Utils.buildColumnDescriptor(colSpec);
+ paths.put(ColumnPath.get(colSpec.getPath()), descriptor);
+}
+ }
+
+ private static ParquetReadOptions buildParquetReadOptions(
+ Configuration conf,
+ Map properties,
+ Long start,
+ Long length,
+ byte[] fileEncryptionKey,
+ byte[] fileAADPrefix) {
+
+Collection readPropertiesToRemove =
+Sets.newHashSet(
+"parquet.read.filter",
Review Comment:
Could we use constants such as `ParquetInputFormat.UNBOUND_RECORD_FILTER`
or doing so will again create the dependency we don't want?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
hsiang-c commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3006454989 > In my local copy of Iceberg, I updated SparkBatchQueryScan to implement SupportsComet. @andygrove You can apply the diff to Iceberg 1.8.1 for Comet support. I'll add Iceberg `1.9.1` diff soon. https://github.com/apache/datafusion-comet/blob/main/dev/diffs/iceberg/1.8.1.diff#L236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3006159227 I am now able to get this working end-to-end :tada: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005044455 The `NoSuchMethodError` error was my mistake. I was building Comet for Spark 3.4 but using the jar for Spark 3.5 when testing. I no longer see any errors, but Comet is not accelerating the Iceberg scan: ``` COMET: Unsupported scan: org.apache.iceberg.spark.source.SparkBatchQueryScan. Comet Scan only supports Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is not supported ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2167562231
##
common/pom.xml:
##
@@ -26,7 +26,7 @@ under the License.
org.apache.datafusion
comet-parent-spark${spark.version.short}_${scala.binary.version}
-0.9.0-SNAPSHOT
+0.9.0.1-SNAPSHOT
Review Comment:
@huaxingao I'm assuming that the version bump was not intentional?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
snmvaughan commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005663410 I'm surprised we don't have a Comet interface which provides the access needed by Comet, in combination with an Iceberg implementation of that interface that wraps the Iceberg details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3005415617 In my local copy of Iceberg, I updated `SparkBatchQueryScan` to implement `SupportsComet`. When using both the Comet and Iceberg jars on the classpath, Comet is unable to recognize that `SparkBatchQueryScan` implements `SupportsComet`, and I think this is because there are two copies of the `SupportsComet` interface (one in the Comet jar and one in the Iceberg jar). If I just have the Iceberg jar on the classpath then Comet does try and accelerate the scan, but now I am running into Arrow shading issues. Spark 3.4.3 uses Arrow 11.0.0 Comet uses Arrow 18.3.0 Iceberg uses Arrow 15.0.2 It is not possible to shade the JNI classes in Arrow because the Java names have to match the function names in the native code, so this is quite challenging to resolve. We may need to look into using class loader isolation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on PR #1920:
URL:
https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3004952533
I added the following method to `FileReader` locally:
```scala
/** Sets the projected columns to be read later via {@link
#readNextRowGroup()} */
public void setRequestedSchemaFromSpecs(List
projection) {
paths.clear();
for (ParquetColumnSpec columnSpec : projection) {
ColumnDescriptor col = Utils.buildColumnDescriptor(columnSpec);
paths.put(ColumnPath.get(col.getPath()), col);
}
}
```
I can now compile Iceberg, but I get an exception at runtime, and I do not
yet understand why:
```
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.3
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.27)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sql(s"CREATE TABLE IF NOT EXISTS t1 (c0 INT, c1 STRING) USING
iceberg")
25/06/25 08:12:29 INFO core/src/lib.rs: Comet native library version 0.9.0
initialized
25/06/25 08:12:29 WARN CometExecRule: Comet cannot execute some parts of
this plan natively (set spark.comet.explainFallback.enabled=false to disable
this logging):
CreateTable [COMET: CreateTable is not supported]
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql(s"INSERT INTO t1 VALUES ${(0 until 1).map(i => (i,
i)).mkString(",")}")
25/06/25 08:12:32 WARN CometExecRule: Comet cannot execute some parts of
this plan natively (set spark.comet.explainFallback.enabled=false to disable
this logging):
AppendData [COMET: AppendData is not supported]
+- LocalTableScan [COMET: LocalTableScan is not supported]
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql(s"SELECT * from t1").show()
25/06/25 08:12:35 WARN CometExecRule: Comet cannot execute some parts of
this plan natively (set spark.comet.explainFallback.enabled=false to disable
this logging):
CollectLimit [COMET: CollectLimit is not supported]
+- Project
+- BatchScan spark_catalog.default.t1 [COMET: Unsupported scan:
org.apache.iceberg.spark.source.SparkBatchQueryScan. Comet Scan only supports
Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is
not supported]
25/06/25 08:12:35 WARN CheckAllocator: More than one
DefaultAllocationManager on classpath. Choosing first found
25/06/25 08:12:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 32)
java.lang.NoSuchMethodError: 'void
org.apache.comet.parquet.FileReader.setRequestedSchemaFromSpecs(java.util.List)'
at
org.apache.iceberg.parquet.CometVectorizedParquetReader$FileIterator.newCometReader(CometVectorizedParquetReader.java:222)
```
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3004865649 I am testing this locally now. There is still one API call that references a Parquet class, causing Iceberg to fail to compile: ``` /home/andy/git/apache/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/CometVectorizedParquetReader.java:222: error: cannot find symbol fileReader.setRequestedSchemaFromSpecs(specs); ``` This method signature is `public void setRequestedSchema(List projection)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3002056893 cc @andygrove @parthchandra @hsiang-c Could you please review this PR? Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-3002050043 I have a draft iceberg [PR](https://github.com/apache/iceberg/pull/13378) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2164143550
##
common/src/main/java/org/apache/comet/parquet/ColumnReader.java:
##
@@ -126,6 +126,13 @@ public void setPageReader(PageReader pageReader) throws
IOException {
}
}
+ /** This method is called from Apache Iceberg. */
+ public void setRowGroupReader(RowGroupReader rowGroupReader,
ParquetColumnSpec columnSpec)
Review Comment:
I filed https://github.com/apache/datafusion-comet/issues/1928
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-2997669574 I somehow got some strange errors: ``` [info] ParquetV1QuerySuite: [info] - simple select queries (635 milliseconds) [info] - appending (254 milliseconds) [info] - overwriting (562 milliseconds) [info] - self-join (258 milliseconds) [info] - nested data - struct with array field (406 milliseconds) [info] - nested data - array of struct (390 milliseconds) [info] - SPARK-1913 regression: columns only referenced by pushed down filters should remain (363 milliseconds) [info] - SPARK-5309 strings stored using dictionary compression in parquet (821 milliseconds) [info] - SPARK-6917 DecimalType should work with non-native types (201 milliseconds) [info] - SPARK-10634 timestamp written and read as INT64 - truncation (331 milliseconds) [info] - SPARK-36182, SPARK-47368: writing and reading TimestampNTZType column (750 milliseconds) 17:56:39.487 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 140.0 (TID 215) org.apache.spark.SparkException: Encountered error while reading file file:///__w/datafusion-comet/datafusion-comet/apache-spark/target/tmp/spark-8a25ab9e-6dec-41e3-8354-ceb405cbd9d5/part-1-22a1399a-9d0c-4972-bf97-e6bdffa1c9cd-c000.snappy.parquet. Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:864) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:296) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131) at org.apache.spark.sql.comet.CometScanExec$$anon$1.hasNext(CometScanExec.scala:266) ``` ``` 17:57:05.557 WARN org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.parquet.ParquetV1QuerySuite, threads: QueryStageCreator-47 (daemon=true), QueryStageCreator-52 (daemon=true), QueryStageCreator-58 (daemon=true), QueryStageCreator-54 (daemon=true), comet-broadcast-exchange-130 (daemon=true), shuffle-boss-714-1 (daemon=true), QueryStageCreator-57 (daemon=true), QueryStageCreator-46 (daemon=true), QueryStageCreator-59 (daemon=true), rpc-boss-711-1 (daemon=true), QueryStageCreator-51 (daemon=true), QuerySta... [info] org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite *** ABORTED *** (31 seconds, 172 milliseconds) [info] The code passed to eventually never returned normally. Attempted 15 times over 10.06552673901 seconds. Last failure message: There are 12 possibly leaked file streams.. (SharedSparkSession.scala:189) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:47) [info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach(SharedSparkSession.scala:189) [info] at org.apache.spark.sql.test.SharedSparkSessionBase.afterEach$(SharedSparkSession.scala:183) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.afterEach(ParquetQuerySuite.scala:47) [info] at org.scalatest.BeforeAndAfterEach.$anonfun$runTest$1(BeforeAndAfterEach.scala:247) [info] at org.scalatest.Status.$anonfun$withAfterEffect$1(Status.scala:377) [info] at org.scalatest.Status.$anonfun$withAfterEffect$1$adapted(Status.scala:373) [info] at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:462) [info] at org.scalatest.Status.withAfterEffect(Status.scala:373) [info] at org.scalatest.Status.withAfterEffect$(Status.scala:371) [info] at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:434) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:246) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) ``` I am still trying to figure out whether these errors are caused by my changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: github-h...@
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
andygrove commented on code in PR #1920:
URL: https://github.com/apache/datafusion-comet/pull/1920#discussion_r2162206037
##
common/src/main/java/org/apache/comet/parquet/ColumnReader.java:
##
@@ -126,6 +126,13 @@ public void setPageReader(PageReader pageReader) throws
IOException {
}
}
+ /** This method is called from Apache Iceberg. */
+ public void setRowGroupReader(RowGroupReader rowGroupReader,
ParquetColumnSpec columnSpec)
Review Comment:
It would be good to have some unit tests in Comet for the methods intended
to be called from Iceberg, so that we catch any regressions in behavior. This
could be added as a separate PR so that we don't slow down progress on the
integration work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao closed pull request #1920: feat: Encapsulate Parquet objects URL: https://github.com/apache/datafusion-comet/pull/1920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Encapsulate Parquet objects [datafusion-comet]
codecov-commenter commented on PR #1920: URL: https://github.com/apache/datafusion-comet/pull/1920#issuecomment-2992325126 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `14.89362%` with `80 lines` in your changes missing coverage. Please review. > Project coverage is 32.79%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`38f9c8c`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/38f9c8cd03230297125d6ad3cf6ee520a67fc6b5?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 275 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...main/java/org/apache/comet/parquet/FileReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FFileReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0ZpbGVSZWFkZXIuamF2YQ==) | 18.96% | [46 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../src/main/java/org/apache/comet/parquet/Utils.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FUtils.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1V0aWxzLmphdmE=) | 0.00% | [15 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...va/org/apache/comet/parquet/ParquetColumnSpec.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FParquetColumnSpec.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1BhcnF1ZXRDb2x1bW5TcGVjLmphdmE=) | 0.00% | [14 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [...in/java/org/apache/comet/parquet/ColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbHVtblJlYWRlci5qYXZh) | 0.00% | [3 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../java/org/apache/comet/parquet/RowGroupReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FRowGroupReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L1Jvd0dyb3VwUmVhZGVyLmphdmE=) | 75.00% | [1 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1920?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1920 +/- ## = - Coverage 56.12% 32.79% -23.34% + Complexity 976 806 -170 = Files 119 131 +12 Lines 1174312909 +1166 Branches 2251 2407 +156
[PR] feat: Encapsulate Parquet objects [datafusion-comet]
huaxingao opened a new pull request, #1920: URL: https://github.com/apache/datafusion-comet/pull/1920 ## Which issue does this PR close? Closes #. ## Rationale for this change Iceberg shades Parquet. We can't pass Parquet objects from Iceberg to Comet. In order to get around this problem, this PR encapsulates the Parquet objects. Here is the summary of the changes: Iceberg call these APIs: ``` public static ColumnReader getColumnReader( DataType type, ColumnDescriptor descriptor, CometSchemaImporter importer, int batchSize, boolean useDecimal128, boolean useLazyMaterialization) ColumnReader.setPageReader(PageReader pageReader) ``` In order to encapsulate `ColumnDescriptor` and `PageReader`, will add a `ParquetColumnSpec`, change the above two APIs to ``` public static ColumnReader getColumnReader( DataType type, ParquetColumnSpec columnSpec, CometSchemaImporter importer, int batchSize, boolean useDecimal128, boolean useLazyMaterialization) // construct a ColumnDescriptor from ParquetColumnSpec setRowGroupReader(org.apache.comet.parquet.RowGroupReader rowGroupReader, ParquetColumnSpec columnSpec) // Will call PageReader pageReader = RowGroupReader.getPageReader(ColumnDescriptor) // setPageReader(pageReader); ``` In order to call `setRowGroupReader(org.apache.comet.parquet.RowGroupReader rowGroupReader, ParquetColumnSpec columnSpec)`, in Iceberg side, will need to use Comet's `FileReader` instead of `ParquetFileReader`, so we will call `FileReader.readNextRowGroup()` to get a `org.apache.comet.parquet.RowGroupReader` instead Parquet's `PageReadStore`. `ParquetReadOption` can't be passed directly either, so the related info are passed and `ParquetReadOption` is built on Comet. ## What changes are included in this PR? ## How are these changes tested? I did integration test on my local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
