beliefer opened a new issue, #12120:
URL: https://github.com/apache/gluten/issues/12120
### Backend
VL (Velox)
### Bug description
## Reproduction
Iceberg table `opdshare_kefu.yw_question` has a `qbody` STRING column
containing
mixed Chinese characters and embedded `0x0A` (LF) bytes.
```sql
select length(qbody),
length(regexp_replace(qbody, '\\n', '\\\\n'))
from opdshare_kefu.yw_question
where id='86648395' and dt='20260509';
┌────────┬───────────────┬─────────────────────────────────────────────────┐
│ Engine │ length(qbody) │ length(regexp_replace(...))
│
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Spark │ 629 │ 647 (correct, +18 chars: each LF → 2-char "\n")
│
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Gluten │ 629 │ 611 (wrong, -18 chars: each LF deleted)
│
└────────┴───────────────┴─────────────────────────────────────────────────┘
The qbody contains 18 LF bytes interleaved with multibyte UTF-8 (Chinese)
characters.
Reduces to substr(50) repro
select length(regexp_replace(substr(qbody, 1, 50), '\\n', '\\\\n'))
from opdshare_kefu.yw_question where id='86648395' and dt='20260509';
-- Spark: 53, Gluten: 47
Inline literal does NOT trigger
select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n',
'\\\\n'));
-- Both Spark and Gluten: 53 (correct)
NativeScan column input does trigger
The bug only appears when input flows from IcebergBatchScanTransformer (or
likely any *ScanTransformer producing Velox StringViews referencing the
original column buffer).
Workaround
Use replace() (literal string replace, not regex):
replace(qbody, unhex('0A'), '\\n') -- works correctly on both engines
Or rebuild the string via unhex(hex(col)):
regexp_replace(unhex(hex(qbody)), '\\n', '\\\\n') -- bug avoided
Suspected root cause
When regexp_replace operates on a Velox StringView pointing into the
original
column buffer, the LF byte (0x08+ control code) immediately preceded or
followed
by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
boundary handling. Inline literals work because they go through a different
code path that materializes the string before regex.
Impact
100% of rows containing both LF and CJK characters produce corrupt output.
We discovered this when comparing 266k rows between Spark and Gluten in
customer service text data — every single output row differs.
### Gluten version
Gluten-1.3
### Spark version
Spark-3.5.x
### Spark configurations
_No response_
### System information
_No response_
### Relevant logs
```bash
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]