Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
GlutenPerfBot commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1996831371 = Performance report for TPCH SF2000 with Velox backend, for reference only query log/native_3870_time.csv log/native_master_03_13_2024_d7ed0844e_time.csv difference percentage q1 35.41 38.81 3.399 109.60% q2 25.71 24.06 -1.644 93.60% q3 37.01 38.18 1.169 103.16% q4 40.14 38.49 -1.657 95.87% q5 68.03 69.71 1.671 102.46% q6 7.43 7.45 0.026 100.35% q7 83.15 82.46 -0.695 99.16% q8 85.00 83.21 -1.785 97.90% q9 122.38 121.83 -0.549 99.55% q10 44.07 44.36 0.294 100.67% q11 19.80 20.90 1.101 105.56% q12 26.65 28.06 1.413 105.30% q13 48.44 46.88 -1.561 96.78% q14 22.24 21.98 -0.259 98.83% q15 32.04 33.08 1.046 103.26% q16 14.72 13.84 -0.883 94.00% q17 100.17 101.80 1.629 101.63% q18 142.74 141.05 -1.683 98.82% q19 13.54 15.07 1.529 111.30%
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhli1142015 merged PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1996582788 > @gaoyangxiaozhu seems there is some conflicts and you need a rebase. Thanks. done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1996273415 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1996167070 @gaoyangxiaozhu seems there is some conflicts and you need a rebase. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhouyuan commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1996119904 @gaoyangxiaozhu This is in good state to me. Will try with internal delta lake jenkins job CC @zzcclp as this will also change the API for CK backend -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1994440146 can we merge the PR ? @zhli1142015 / @yma11 / @zhouyuan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1993960733 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1993655548 can we re-trigger again for failed job which all due to below ![image](https://github.com/apache/incubator-gluten/assets/9858245/ccac3728-f0d2-41f1-89be-85b40758518f) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1993211907 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1992105649 various build issue , can you help re-trigger @yma11 / @zhouyuan / @zhli1142015 ![image](https://github.com/apache/incubator-gluten/assets/9858245/9138a799-b008-4fa5-9600-43b2a0b5ebc6) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1992018595 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521381033 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: I think so but we'd better confirm first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1991460438 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521217320 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: so if is true for write then we only process if `columnsTypes.size() > 0` otherwise return directly? ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: so if is true for write then we only process if `columnsTypes.size() > 0` otherwise do nothing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521217320 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: so if is true for write then we only process under `columnsTypes.size() > 0` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1991300996 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521209340 ## gluten-core/src/main/scala/io/glutenproject/execution/BasicScanExecTransformer.scala: ## @@ -37,12 +37,15 @@ import com.google.protobuf.StringValue import scala.collection.JavaConverters._ trait BasicScanExecTransformer extends LeafTransformSupport with BaseDataSource { + import org.apache.spark.sql.catalyst.util._ /** Returns the filters that can be pushed down to native file scan */ def filterExprs(): Seq[Expression] def outputAttributes(): Seq[Attribute] + def getMetadataColumns(): Seq[AttributeReference] Review Comment: this is for compatible spark33 purpose which already defineds `metadataColumns` fields https://github.com/apache/spark/blob/e98872fb5d07d570e6d0516b49a5d2e58876d1a6/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L208C2-L210C1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521186042 ## gluten-ut/spark33/src/test/scala/org/apache/spark/sql/execution/datasources/GlutenFileMetadataStructSuite.scala: ## @@ -16,6 +16,156 @@ */ package org.apache.spark.sql.execution.datasources +import io.glutenproject.execution.FilterExecTransformerBase +import io.glutenproject.utils.BackendTestUtils + +import org.apache.spark.sql.{Column, DataFrame, Row} import org.apache.spark.sql.GlutenSQLTestsBaseTrait +import org.apache.spark.sql.GlutenTestConstants.GLUTEN_TEST +import org.apache.spark.sql.execution.FileSourceScanExec +import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType} + +import java.io.File +import java.sql.Timestamp + +class GlutenFileMetadataStructSuite extends FileMetadataStructSuite with GlutenSQLTestsBaseTrait { + + val schemaWithFilePathField: StructType = new StructType() +.add(StructField("file_path", StringType)) +.add(StructField("age", IntegerType)) +.add( + StructField( +"info", +new StructType() + .add(StructField("id", LongType)) + .add(StructField("university", StringType + + private val METADATA_FILE_PATH = "_metadata.file_path" + private val METADATA_FILE_NAME = "_metadata.file_name" + private val METADATA_FILE_SIZE = "_metadata.file_size" + private val METADATA_FILE_MODIFICATION_TIME = "_metadata.file_modification_time" + + private def getMetadataForFile(f: File): Map[String, Any] = { +Map( + METADATA_FILE_PATH -> f.toURI.toString, + METADATA_FILE_NAME -> f.getName, + METADATA_FILE_SIZE -> f.length(), + METADATA_FILE_MODIFICATION_TIME -> new Timestamp(f.lastModified()) +) + } + + private def metadataColumnsNativeTest(testName: String, fileSchema: StructType)( + f: (DataFrame, Map[String, Any], Map[String, Any]) => Unit): Unit = { +Seq("parquet").foreach { + testFileFormat => +test(s"$GLUTEN_TEST metadata struct ($testFileFormat): " + testName) { + withTempDir { +dir => + import scala.collection.JavaConverters._ + + // 1. create df0 and df1 and save under /data/f0 and /data/f1 + val df0 = spark.createDataFrame(data0.asJava, fileSchema) + val f0 = new File(dir, "data/f0").getCanonicalPath + df0.coalesce(1).write.format(testFileFormat).save(f0) + + val df1 = spark.createDataFrame(data1.asJava, fileSchema) + val f1 = new File(dir, "data/f1 gluten").getCanonicalPath + df1.coalesce(1).write.format(testFileFormat).save(f1) + + // 2. read both f0 and f1 + val df = spark.read +.format(testFileFormat) +.schema(fileSchema) +.load(new File(dir, "data").getCanonicalPath + "/*") + val realF0 = new File(dir, "data/f0") +.listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")) +.head + val realF1 = new File(dir, "data/f1 gluten") +.listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")) +.head + f(df, getMetadataForFile(realF0), getMetadataForFile(realF1)) + } +} +} + } + + metadataColumnsNativeTest( +"plan check with metadata and user data select", +schemaWithFilePathField) { +(df, f0, f1) => + var dfWithMetadata = df.select( +METADATA_FILE_NAME, +METADATA_FILE_PATH, +METADATA_FILE_SIZE, +METADATA_FILE_MODIFICATION_TIME, +"age") + dfWithMetadata.collect + var fileScan = dfWithMetadata.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f + } + assert(fileScan.size == 1) + if (BackendTestUtils.isVeloxBackendLoaded()) { +assert(fileScan(0).nodeNamePrefix == "NativeFile") + } else { +assert(fileScan(0).nodeNamePrefix == "File") + } + + // would fallback + dfWithMetadata = df.select(METADATA_FILE_PATH, "file_path") + checkAnswer( +dfWithMetadata, +Seq( + Row(f0(METADATA_FILE_PATH), "jack"), + Row(f1(METADATA_FILE_PATH), "lily") +) + ) + fileScan = dfWithMetadata.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f + } + assert(fileScan(0).nodeNamePrefix == "File") + } + + metadataColumnsNativeTest("plan check with metadata filter", schemaWithFilePathField) { +(df, f0, f1) => + var filterDF = df +.select("file_path", "age", METADATA_FILE_NAME) +.where(Column(METADATA_FILE_NAME) === f0((METADATA_FILE_NAME))) + val ret = filterDF.collect + assert(ret.size == 1) + var fileScan = filterDF.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f +
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521185608 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: @JkSelf will it happen for parquet write? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521062380 ## cpp/velox/compute/WholeStageResultIterator.cc: ## @@ -145,15 +145,31 @@ WholeStageResultIterator::WholeStageResultIterator( const auto& lengths = scanInfo->lengths; const auto& format = scanInfo->format; const auto& partitionColumns = scanInfo->partitionColumns; +const auto& metadataColumns = scanInfo->metadataColumns; std::vector> connectorSplits; connectorSplits.reserve(paths.size()); for (int idx = 0; idx < paths.size(); idx++) { auto partitionColumn = partitionColumns[idx]; + auto metadataColumn = metadataColumns[idx]; std::unordered_map> partitionKeys; constructPartitionColumns(partitionKeys, partitionColumn); + const std::unordered_map& customSplitInfo = {}; + const std::shared_ptr& extraFileInfo = {}; + const std::unordered_map& serdeParameters = {}; Review Comment: Do you plan to fill these variables in late PR? if not, I think we can avoid creating them but just use `{}` in the constructor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521142100 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: for read node it should be true, but not sure for write node if `namedStruct.names()` also is 0 size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521135199 ## shims/spark34/src/main/scala/io/glutenproject/sql/shims/spark34/Spark34Shims.scala: ## @@ -153,6 +157,34 @@ class Spark34Shims extends SparkShims { } } + override def generateMetadataColumns( + file: PartitionedFile, + metadataColumnNames: Seq[String]): JMap[String, String] = { +val metadataColumn = new JHashMap[String, String]() +val path = new Path(file.filePath.toString) +for (columnName <- metadataColumnNames) { + columnName match { +case FileFormat.FILE_PATH => metadataColumn.put(FileFormat.FILE_PATH, path.toString) +case FileFormat.FILE_NAME => metadataColumn.put(FileFormat.FILE_NAME, path.getName) +case FileFormat.FILE_SIZE => + metadataColumn.put(FileFormat.FILE_SIZE, file.fileSize.toString) +case FileFormat.FILE_MODIFICATION_TIME => + val fileModifyTime = TimestampFormatter +.getFractionFormatter(ZoneOffset.UTC) +.format(file.modificationTime * 1000L) + metadataColumn.put(FileFormat.FILE_MODIFICATION_TIME, fileModifyTime) +case FileFormat.FILE_BLOCK_START => + metadataColumn.put(FileFormat.FILE_BLOCK_START, file.start.toString) +case FileFormat.FILE_BLOCK_LENGTH => + metadataColumn.put(FileFormat.FILE_BLOCK_LENGTH, file.length.toString) +case _ => + } +} + +// TODO row_index metadata support Review Comment: `TODO: ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhouyuan commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1991116962 Quick note: This patch adds the native impl for parquet metadata read support which is a requirement from delta lake connector. The general idea is to add a metadata list param for filescan/batchscan which will be filled in Velox native scan. Thanks, -yuan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521102914 ## gluten-core/src/main/scala/io/glutenproject/execution/BasicScanExecTransformer.scala: ## @@ -37,12 +37,15 @@ import com.google.protobuf.StringValue import scala.collection.JavaConverters._ trait BasicScanExecTransformer extends LeafTransformSupport with BaseDataSource { + import org.apache.spark.sql.catalyst.util._ /** Returns the filters that can be pushed down to native file scan */ def filterExprs(): Seq[Expression] def outputAttributes(): Seq[Attribute] + def getMetadataColumns(): Seq[AttributeReference] Review Comment: rename to `metadataColums()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhouyuan commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521101767 ## gluten-ut/spark33/src/test/scala/org/apache/spark/sql/execution/datasources/GlutenFileMetadataStructSuite.scala: ## @@ -16,6 +16,156 @@ */ package org.apache.spark.sql.execution.datasources +import io.glutenproject.execution.FilterExecTransformerBase +import io.glutenproject.utils.BackendTestUtils + +import org.apache.spark.sql.{Column, DataFrame, Row} import org.apache.spark.sql.GlutenSQLTestsBaseTrait +import org.apache.spark.sql.GlutenTestConstants.GLUTEN_TEST +import org.apache.spark.sql.execution.FileSourceScanExec +import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType} + +import java.io.File +import java.sql.Timestamp + +class GlutenFileMetadataStructSuite extends FileMetadataStructSuite with GlutenSQLTestsBaseTrait { + + val schemaWithFilePathField: StructType = new StructType() +.add(StructField("file_path", StringType)) +.add(StructField("age", IntegerType)) +.add( + StructField( +"info", +new StructType() + .add(StructField("id", LongType)) + .add(StructField("university", StringType + + private val METADATA_FILE_PATH = "_metadata.file_path" + private val METADATA_FILE_NAME = "_metadata.file_name" + private val METADATA_FILE_SIZE = "_metadata.file_size" + private val METADATA_FILE_MODIFICATION_TIME = "_metadata.file_modification_time" + + private def getMetadataForFile(f: File): Map[String, Any] = { +Map( + METADATA_FILE_PATH -> f.toURI.toString, + METADATA_FILE_NAME -> f.getName, + METADATA_FILE_SIZE -> f.length(), + METADATA_FILE_MODIFICATION_TIME -> new Timestamp(f.lastModified()) +) + } + + private def metadataColumnsNativeTest(testName: String, fileSchema: StructType)( + f: (DataFrame, Map[String, Any], Map[String, Any]) => Unit): Unit = { +Seq("parquet").foreach { + testFileFormat => +test(s"$GLUTEN_TEST metadata struct ($testFileFormat): " + testName) { + withTempDir { +dir => + import scala.collection.JavaConverters._ + + // 1. create df0 and df1 and save under /data/f0 and /data/f1 + val df0 = spark.createDataFrame(data0.asJava, fileSchema) + val f0 = new File(dir, "data/f0").getCanonicalPath + df0.coalesce(1).write.format(testFileFormat).save(f0) + + val df1 = spark.createDataFrame(data1.asJava, fileSchema) + val f1 = new File(dir, "data/f1 gluten").getCanonicalPath + df1.coalesce(1).write.format(testFileFormat).save(f1) + + // 2. read both f0 and f1 + val df = spark.read +.format(testFileFormat) +.schema(fileSchema) +.load(new File(dir, "data").getCanonicalPath + "/*") + val realF0 = new File(dir, "data/f0") +.listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")) +.head + val realF1 = new File(dir, "data/f1 gluten") +.listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")) +.head + f(df, getMetadataForFile(realF0), getMetadataForFile(realF1)) + } +} +} + } + + metadataColumnsNativeTest( +"plan check with metadata and user data select", +schemaWithFilePathField) { +(df, f0, f1) => + var dfWithMetadata = df.select( +METADATA_FILE_NAME, +METADATA_FILE_PATH, +METADATA_FILE_SIZE, +METADATA_FILE_MODIFICATION_TIME, +"age") + dfWithMetadata.collect + var fileScan = dfWithMetadata.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f + } + assert(fileScan.size == 1) + if (BackendTestUtils.isVeloxBackendLoaded()) { +assert(fileScan(0).nodeNamePrefix == "NativeFile") + } else { +assert(fileScan(0).nodeNamePrefix == "File") + } + + // would fallback + dfWithMetadata = df.select(METADATA_FILE_PATH, "file_path") + checkAnswer( +dfWithMetadata, +Seq( + Row(f0(METADATA_FILE_PATH), "jack"), + Row(f1(METADATA_FILE_PATH), "lily") +) + ) + fileScan = dfWithMetadata.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f + } + assert(fileScan(0).nodeNamePrefix == "File") + } + + metadataColumnsNativeTest("plan check with metadata filter", schemaWithFilePathField) { +(df, f0, f1) => + var filterDF = df +.select("file_path", "age", METADATA_FILE_NAME) +.where(Column(METADATA_FILE_NAME) === f0((METADATA_FILE_NAME))) + val ret = filterDF.collect + assert(ret.size == 1) + var fileScan = filterDF.queryExecution.executedPlan.collect { +case f: FileSourceScanExec => f +
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521090112 ## cpp/velox/substrait/SubstraitParser.cc: ## @@ -104,31 +104,41 @@ std::vector SubstraitParser::parseNamedStruct(const ::substrait::NamedS return typeList; } -std::vector SubstraitParser::parsePartitionColumns(const ::substrait::NamedStruct& namedStruct) { +void SubstraitParser::parsePartitionAndMetadataColumns( +const ::substrait::NamedStruct& namedStruct, +std::vector& isPartitionColumns, +std::vector& isMetadataColumns) { const auto& columnsTypes = namedStruct.column_types(); - std::vector isPartitionColumns; if (columnsTypes.size() == 0) { -// Regard all columns as non-partitioned columns. +// Regard all columns as regular columns. Review Comment: If `columnsTypes` has 0 as size, does it mean that there is no columns in this table? if so, maybe we should do a check `columnsTypes.size() > 0` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
yma11 commented on code in PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#discussion_r1521062380 ## cpp/velox/compute/WholeStageResultIterator.cc: ## @@ -145,15 +145,31 @@ WholeStageResultIterator::WholeStageResultIterator( const auto& lengths = scanInfo->lengths; const auto& format = scanInfo->format; const auto& partitionColumns = scanInfo->partitionColumns; +const auto& metadataColumns = scanInfo->metadataColumns; std::vector> connectorSplits; connectorSplits.reserve(paths.size()); for (int idx = 0; idx < paths.size(); idx++) { auto partitionColumn = partitionColumns[idx]; + auto metadataColumn = metadataColumns[idx]; std::unordered_map> partitionKeys; constructPartitionColumns(partitionKeys, partitionColumn); + const std::unordered_map& customSplitInfo = {}; + const std::shared_ptr& extraFileInfo = {}; + const std::unordered_map& serdeParameters = {}; Review Comment: Do you plan to fill these variables in late PR? if not, I think it's avoid to create them but just use `{}` in the constructor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhouyuan commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1991044843 CC: @yma11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
gaoyangxiaozhu commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1991024316 @zhli1142015 and @zhouyuan could you help review ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1990046048 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1990049174 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1988949586 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1988851240 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1988664197 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1988662251 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1988477007 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
zhli1142015 commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1987856904 THe CI faulire looks related to your change @gaoyangxiaozhu, please check, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1984028416 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [VL] parquet file metadata columns support in velox [incubator-gluten]
github-actions[bot] commented on PR #3870: URL: https://github.com/apache/incubator-gluten/pull/3870#issuecomment-1983982623 Run Gluten Clickhouse CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org