hudi-agent commented on code in PR #18583: URL: https://github.com/apache/hudi/pull/18583#discussion_r3200756030
########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala: ########## @@ -0,0 +1,283 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.hudi.dml.schema + +import org.apache.hudi.blob.BlobTestHelpers +import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaType} +import org.apache.hudi.testutils.DataSourceTestUtils +import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient + +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase + +import java.io.File + +class TestBlobDataType extends HoodieSparkSqlTestBase { + + private val referenceStructType = + "struct<external_path:string, offset:bigint, length:bigint, managed:boolean>" + + private def inlineBlobLiteral(hex: String): String = + s"""named_struct( + | 'type', 'INLINE', + | 'data', cast(X'$hex' as binary), + | 'reference', cast(null as $referenceStructType) + |)""".stripMargin + + private def outOfLineBlobLiteral(path: String, offset: Long, length: Long): String = + s"""named_struct( + | 'type', 'OUT_OF_LINE', + | 'data', cast(null as binary), + | 'reference', named_struct( + | 'external_path', '$path', + | 'offset', cast($offset as bigint), + | 'length', cast($length as bigint), + | 'managed', false + | ) + |)""".stripMargin Review Comment: _⚠️ Potential issue_ | _🟡 Minor_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && find . -name "TestBlobDataType.scala" -type f ``` Repository: hudi-agent/hudi Length of output: 136 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && cat -n src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | head -200 ``` Repository: hudi-agent/hudi Length of output: 10107 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && wc -l src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala ``` Repository: hudi-agent/hudi Length of output: 138 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && grep -n "'\$" src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala ``` Repository: hudi-agent/hudi Length of output: 243 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && sed -n '43,53p' src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala ``` Repository: hudi-agent/hudi Length of output: 476 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && sed -n '175,183p' src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala ``` Repository: hudi-agent/hudi Length of output: 285 --- 🏁 Script executed: ```shell cd hudi-spark-datasource/hudi-spark && sed -n '60,75p' src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala ``` Repository: hudi-agent/hudi Length of output: 548 --- </details> **Escape single quotes in SQL string interpolations.** Paths interpolated at line 48 (`external_path`), line 66, and line 182 (`location`) can contain single quotes, breaking SQL parsing. For example, a path like `/tmp/test'file/hudi` would produce invalid SQL. Escape single quotes by doubling them (SQL standard). Add a helper function and apply it to all path interpolations: <details> <summary>Proposed fix</summary> ```diff + private def sqlStringLiteral(value: String): String = value.replace("'", "''") + private def outOfLineBlobLiteral(path: String, offset: Long, length: Long): String = s"""named_struct( | 'type', 'OUT_OF_LINE', | 'data', cast(null as binary), | 'reference', named_struct( - | 'external_path', '$path', + | 'external_path', '${sqlStringLiteral(path)}', | 'offset', cast($offset as bigint), | 'length', cast($length as bigint), | 'managed', false | ) |)""".stripMargin ``` ```diff - | location '$tablePath' + | location '${sqlStringLiteral(tablePath)}' ``` </details> Also applies to: 60-67, 176-183 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala` around lines 43 - 53, The SQL string interpolations insert raw paths that may contain single quotes (e.g., in outOfLineBlobLiteral's 'external_path' and other SQL builders that set 'external_path' or 'location'), which will break SQL parsing; add a small helper (e.g., escapeSqlSingleQuotes(s: String): String = s.replace("'", "''")) and use it when embedding paths into SQL string literals so every occurrence (reference in outOfLineBlobLiteral and the other SQL-building functions that interpolate external_path/location) wraps the path with the helper before interpolation. ``` </details> <!-- fingerprinting:phantom:poseidon:hawk:11cb87ff-ebec-417e-8faa-4ca0e3723635 --> <!-- d98c2f50 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/hudi-agent/hudi/pull/18#discussion_r3143063277)) (source:comment#3143063277) ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala: ########## @@ -0,0 +1,283 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.hudi.dml.schema + +import org.apache.hudi.blob.BlobTestHelpers +import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaType} +import org.apache.hudi.testutils.DataSourceTestUtils +import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient + +import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase + +import java.io.File + +class TestBlobDataType extends HoodieSparkSqlTestBase { + + private val referenceStructType = + "struct<external_path:string, offset:bigint, length:bigint, managed:boolean>" + + private def inlineBlobLiteral(hex: String): String = + s"""named_struct( + | 'type', 'INLINE', + | 'data', cast(X'$hex' as binary), + | 'reference', cast(null as $referenceStructType) + |)""".stripMargin + + private def outOfLineBlobLiteral(path: String, offset: Long, length: Long): String = + s"""named_struct( + | 'type', 'OUT_OF_LINE', + | 'data', cast(null as binary), + | 'reference', named_struct( + | 'external_path', '$path', + | 'offset', cast($offset as bigint), + | 'length', cast($length as bigint), + | 'managed', false + | ) + |)""".stripMargin + + test("Test Query Log Only MOR Table With BLOB INLINE column triggers compaction") { + withRecordType()(withTempDir { tmp => + val tablePath = new File(tmp, "hudi").getCanonicalPath + val tableName = generateTableName + spark.sql( + s""" + |create table $tableName ( + | id int, + | data blob, + | ts long + |) using hudi + | location '$tablePath' + | tblproperties ( + | primaryKey = 'id', + | type = 'mor', + | preCombineField = 'ts', + | hoodie.index.type = 'INMEMORY', + | hoodie.compact.inline = 'true', + | hoodie.clean.commits.retained = '1' + | ) Review Comment: _⚠️ Potential issue_ | _🟠 Major_ <details> <summary>🧩 Analysis chain</summary> 🏁 Script executed: ```shell #!/bin/bash set -euo pipefail # Verify whether the test pins compaction threshold and inspect config defaults/usage. rg -n "hoodie\.compact\.inline(\.max\.delta\.commits)?|clean\.commits\.retained" --type=scala rg -n "compact.*max.*delta|INLINE.*DELTA|hoodie\.compact\.inline\.max\.delta\.commits" --type=java --type=scala ``` Repository: hudi-agent/hudi Length of output: 46665 --- 🏁 Script executed: ```shell cat -n hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | sed -n '50,120p' ``` Repository: hudi-agent/hudi Length of output: 3425 --- 🏁 Script executed: ```shell cat -n hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | sed -n '210,240p' ``` Repository: hudi-agent/hudi Length of output: 1662 --- 🏁 Script executed: ```shell rg "INLINE_COMPACT_NUM_DELTA_COMMITS.*defaultValue|default.*INLINE|defaultValue.*hoodie\.compact\.inline\.max" hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java -A 2 -B 2 ``` Repository: hudi-agent/hudi Length of output: 295 --- 🏁 Script executed: ```shell rg "INLINE_COMPACT_NUM_DELTA_COMMITS.*=" hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java -A 3 | head -15 ``` Repository: hudi-agent/hudi Length of output: 801 --- 🏁 Script executed: ```shell rg "\.defaultValue\(\)" hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java -B 5 | grep -A 5 "INLINE_COMPACT_NUM_DELTA_COMMITS" ``` Repository: hudi-agent/hudi Length of output: 526 --- 🏁 Script executed: ```shell cat -n hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | sed -n '175,195p' ``` Repository: hudi-agent/hudi Length of output: 841 --- </details> **Explicitly pin the compaction trigger threshold to improve test reliability.** The tests assume compaction occurs at the 5th delta commit (lines 105 and 224), but rely on the implicit default value of `hoodie.compact.inline.max.delta.commits` (currently 5). While this happens to match, the implicit dependency makes tests brittle to future default changes. Many other tests in the codebase explicitly set this property; both test tables should follow that pattern. <details> <summary>Proposed fix</summary> ```diff | tblproperties ( | primaryKey = 'id', | type = 'mor', | preCombineField = 'ts', | hoodie.index.type = 'INMEMORY', | hoodie.compact.inline = 'true', + | hoodie.compact.inline.max.delta.commits = '5', | hoodie.clean.commits.retained = '1' | ) ``` </details> Also applies to: lines 183-190 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala` around lines 67 - 74, The test relies on compaction happening at the 5th delta commit but currently depends on the implicit default; update the table creation tblproperties in TestBlobDataType.scala to explicitly set hoodie.compact.inline.max.delta.commits = '5' for both table definitions (the existing tblproperties block that contains primaryKey, type='mor', preCombineField='ts', hoodie.index.type, hoodie.compact.inline, hoodie.clean.commits.retained) so compaction behavior is deterministic; apply the same explicit property to the second table's tblproperties block referenced around lines 183-190. ``` </details> <!-- fingerprinting:phantom:poseidon:hawk:11cb87ff-ebec-417e-8faa-4ca0e3723635 --> <!-- 4e71b3a2 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/hudi-agent/hudi/pull/18#discussion_r3143063278)) (source:comment#3143063278) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
