linliu-code opened a new issue, #18791:
URL: https://github.com/apache/hudi/issues/18791

   ## Describe the problem you faced
   
   When Hudi's data-skipping (column-stats index) is enabled, a `LIKE 
'prefix%'` predicate (Spark Catalyst `StartsWith`) silently drops rows in 
`1.1.x` and `master`. The same query worked correctly in `0.15.0` and 
`0.15.1-rc1`, so this is a regression introduced in the 1.x line.
   
   Root cause is in the predicate translation: for `StartsWith(col, 'X')` Hudi 
generates `colMin <= 'X' AND 'X' <= colMax`, which only matches files where the 
**single-character literal** `'X'` happens to fall lexicographically inside 
`[min, max]`. For any file that contains multi-character values starting with 
`'X'`, the min is *greater* than `'X'` (because `'X_anything'.compareTo('X') > 
0`), so the file is pruned even though it contains matching rows.
   
   This is silent data loss at query time — no error, no warning, just an empty 
result set.
   
   ## To Reproduce
   
   Single-file pyspark script — no Docker required.
   
   ```bash
   export HUDI_BUNDLE=/path/to/hudi-spark3.4-bundle_2.12-1.1.1.jar
   spark-submit \
     --master 'local[2]' \
     --jars "$HUDI_BUNDLE" \
     --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
     --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
     --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
     --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
     repro.py
   ```
   
   ```python
   """Reproduce StartsWith translation bug in Hudi data-skipping.
   
   The table has NO NaN, NO null, NO truncated values, NO schema evolution.
   Three files, each with 10 string values starting with a single distinct 
character.
   Query: LIKE 'a%' — should match the 10 rows in file 0.
   """
   import os, tempfile
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType, StructField, IntegerType, 
StringType
   
   ROOT = tempfile.mkdtemp(prefix="hudi_startswith_")
   spark = (SparkSession.builder.appName("repro")
       .config("spark.sql.shuffle.partitions","1").getOrCreate())
   spark.sparkContext.setLogLevel("WARN")
   
   schema = StructType([
       StructField("rk", IntegerType(), False),
       StructField("p",  StringType(),  False),
       StructField("s_str", StringType(), True),
   ])
   opts = {
       "hoodie.table.name": "startswith_repro",
       "hoodie.datasource.write.recordkey.field": "rk",
       "hoodie.datasource.write.partitionpath.field": "p",
       "hoodie.datasource.write.precombine.field": "rk",
       "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
       "hoodie.parquet.small.file.limit": "0",
       "hoodie.metadata.enable": "true",
       "hoodie.metadata.index.column.stats.enable": "true",
       "hoodie.metadata.index.column.stats.column.list": "s_str",
   }
   
   # 3 files in 1 partition. NO NaN / NO null. Every value starts with 'a' / 
'b' / 'c'.
   files = [
       [(10000+k, "P", "a_" + format(k, "02d")) for k in range(10)],
       [(20000+k, "P", "b_" + format(k, "02d")) for k in range(10)],
       [(30000+k, "P", "c_" + format(k, "02d")) for k in range(10)],
   ]
   for i, rows in enumerate(files):
       spark.createDataFrame(rows, 
schema).write.format("hudi").options(**opts).mode(
           "overwrite" if i == 0 else "append").save(ROOT)
   
   df_on  = 
spark.read.format("hudi").option("hoodie.enable.data.skipping","true").load(ROOT)
   df_off = 
spark.read.format("hudi").option("hoodie.enable.data.skipping","false").load(ROOT)
   on1  = df_on.where("s_str LIKE 'a%'").count()
   off1 = df_off.where("s_str LIKE 'a%'").count()
   on2  = df_on.where("s_str = 'a_00'").count()
   off2 = df_off.where("s_str = 'a_00'").count()
   print(f"\n  s_str LIKE 'a%'   ON={on1}  OFF={off1}  (expected 10)")
   print(f"  s_str = 'a_00'    ON={on2}  OFF={off2}  (expected 1)")
   spark.stop()
   ```
   
   ## Expected behavior
   
   With `hoodie.enable.data.skipping=true`, `LIKE 'a%'` should never return 
fewer rows than with `=false`. Data-skipping is a transparent performance 
optimization — it must never change query results.
   
   ```
   s_str LIKE 'a%'   ON=10  OFF=10  (expected 10)
   s_str = 'a_00'    ON=1   OFF=1   (expected 1)
   ```
   
   ## Actual behavior — silent zero-row result on 1.1.x and master
   
   Against `hudi-spark3.4-bundle_2.12-1.1.1.jar`:
   
   ```
   s_str LIKE 'a%'   ON=0   OFF=10  (expected 10)    <<< BUG (silent wrong 
result)
   s_str = 'a_00'    ON=1   OFF=1   (expected 1)
   ```
   
   Equality (`= 'a_00'`) works correctly. Only the prefix-match `LIKE 'a%'` is 
broken. The same script returns the correct `ON=10` against `0.15.0`, 
`0.15.1-rc1`, and `1.1.0`-class bundles where I have not yet verified — see 
Cross-version Matrix.
   
   ## Cross-version Matrix
   
   Same script, same Spark 3.4.3, only swapping `--jars`:
   
   | Bundle | `LIKE 'a%'` ON | `LIKE 'a%'` OFF | Verdict |
   |---|---|---|---|
   | `hudi-spark3.4-bundle_2.12-0.15.0.jar` | **10** | 10 | works ✓ |
   | `hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar` | **10** | 10 | works ✓ |
   | `hudi-spark3.4-bundle_2.12-1.1.1.jar` | **0** | 10 | **reproduces silent 
wrong result** |
   | `master HEAD` (1.3.0-SNAPSHOT) | **0** | 10 | **reproduces** |
   
   So this is a **regression in 1.x**, not a long-standing latent bug. Earlier 
0.x releases returned correct results for the same query.
   
   ## Root cause
   
   
`hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala`,
 around line 325 (master HEAD), the `StartsWith` case translates the predicate 
using `genColumnValuesEqualToExpression` — the same helper used for `EqualTo`:
   
   ```scala
   // Filter "colA like 'xxx%'"
   // Translates to "colA_minValue <= xxx AND xxx <= colA_maxValue" for index 
lookup
   //
   // NOTE: Since a) this operator matches strings by prefix and b) given that 
this column is going to be ordered
   //       lexicographically, we essentially need to check that provided 
literal falls w/in min/max bounds of the
   //       given column
   case StartsWith(sourceExpr @ AllowedTransformationExpression(attrRef), v @ 
Literal(_: UTF8String, _)) =>
     getTargetIndexedColumnName(attrRef, indexedCols)
       .map { colName =>
         val targetExprBuilder: Expression => Expression = 
swapAttributeRefInExpr(sourceExpr, attrRef, _)
         genColumnValuesEqualToExpression(colName, v, targetExprBuilder)  // 
produces colMin <= V AND V <= colMax
       }.orElse(Option.empty)
   ```
   
   Which produces:
   
   ```
   colMin <= 'a' AND 'a' <= colMax
   ```
   
   That checks whether **the prefix literal itself** is in `[min, max]`. But 
for prefix matching, the file matches if **any value** in the column starts 
with the prefix. A file with `min='a_00'` and `max='a_09'` clearly matches 
`LIKE 'a%'`, but:
   
   - `min='a_00' <= 'a'` → `FALSE` (because `'a_00' > 'a'` lexicographically — 
sharing the `'a'` prefix and extending further)
   - The `AND` evaluates to false → file pruned
   
   The translation is correct only in the degenerate case where the prefix 
literal equals one of the actual stored values — i.e. for a single-character 
prefix matching single-character values. For any longer values it produces 
silently-wrong results.
   
   ## Suggested fix
   
   For prefix `P`, a file with sorted range `[min, max]` may contain values 
starting with `P` iff its range overlaps `[P, successor(P))`. That is:
   
   ```
   max >= P AND min < successor(P)
   ```
   
   where `successor(P)` is `P` with its last code point incremented (with carry 
into preceding characters if the last is the max code point). For a single 
ASCII letter `'a'`, `successor('a') = 'b'`.
   
   A conservative simplification that's always safe (no false pruning) but 
prunes less aggressively:
   
   ```
   max >= P
   ```
   
   This loses pruning on files whose `max` is greater than `successor(P)` (i.e. 
files containing values lexicographically beyond the prefix range). But it 
never wrongly prunes.
   
   A correct full implementation would compute `successor(P)` for arbitrary 
UTF-8 strings, handling the case where the last code point is `0x10FFFF` (the 
maximum Unicode code point) by carrying into the previous character — or fall 
back to the conservative `max >= P` form when overflow occurs.
   
   The same translation issue applies to `Not(StartsWith(...))` (around line 
338 of the same file): the corrected inversion should likewise reason about 
prefix ranges, not the literal as a value.
   
   ## Environment Description
   
   - Hudi version: **1.1.1** (current GA from Maven Central; 
`hudi-spark3.4-bundle_2.12-1.1.1.jar`). Reproduces identically on master HEAD.
   - Spark version: 3.4.3
   - Hadoop version: 3 (bundled Spark distribution)
   - Storage: local FS — bug is in the predicate translator and is 
storage-independent
   - Running on Docker?: optional
   
   ## Additional context
   
   - This is a different bug from #18754 (NaN col-stats corruption). The two 
were initially observed together but are independent: this bug reproduces with 
**zero NaN** values, with **no nulls**, and with **no truncated strings**. The 
values are well-behaved, short, ASCII strings.
   - Equality predicates (`col = 'X'`) work correctly because for an exact 
match the literal IS expected to be in `[min, max]`.
   - Range predicates (`col > 'X'`) work correctly because they correctly use 
only `colMax > 'X'` (or `colMin > 'X'`) without the bracketing.
   
   ## Workaround available today
   
   Disable data-skipping at query time:
   ```python
   
spark.read.format("hudi").option("hoodie.enable.data.skipping","false").load(path)
   ```
   
   This defeats the purpose of the col-stats feature but guarantees correctness 
for `LIKE 'X%'` queries.
   
   ## Stacktrace
   
   n/a — silent wrong result, no exception, no warning, no log line.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to