[I] Velox regexp_replace drops LF bytes when input is a NativeScan string column [gluten]

via GitHub Thu, 21 May 2026 00:10:51 -0700


beliefer opened a new issue, #12120:
URL: https://github.com/apache/gluten/issues/12120


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   ## Reproduction
     
     Iceberg table `opdshare_kefu.yw_question` has a `qbody` STRING column 
containing
     mixed Chinese characters and embedded `0x0A` (LF) bytes.
   
     ```sql
     select length(qbody),
            length(regexp_replace(qbody, '\\n', '\\\\n'))
     from opdshare_kefu.yw_question
     where id='86648395' and dt='20260509';
   
     
┌────────┬───────────────┬─────────────────────────────────────────────────┐
     │ Engine │ length(qbody) │           length(regexp_replace(...))           
│
     
├────────┼───────────────┼─────────────────────────────────────────────────┤
     │ Spark  │ 629           │ 647 (correct, +18 chars: each LF → 2-char "\n") 
│
     
├────────┼───────────────┼─────────────────────────────────────────────────┤
     │ Gluten │ 629           │ 611 (wrong, -18 chars: each LF deleted)         
│
     
└────────┴───────────────┴─────────────────────────────────────────────────┘
   
     The qbody contains 18 LF bytes interleaved with multibyte UTF-8 (Chinese) 
characters.
   
     Reduces to substr(50) repro
   
     select length(regexp_replace(substr(qbody, 1, 50), '\\n', '\\\\n'))
     from opdshare_kefu.yw_question where id='86648395' and dt='20260509';
     -- Spark: 53, Gluten: 47
   
     Inline literal does NOT trigger
   
     select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n', 
'\\\\n'));
     -- Both Spark and Gluten: 53 (correct)
   
     NativeScan column input does trigger
   
     The bug only appears when input flows from IcebergBatchScanTransformer (or
     likely any *ScanTransformer producing Velox StringViews referencing the
     original column buffer).
   
     Workaround
   
     Use replace() (literal string replace, not regex):
     replace(qbody, unhex('0A'), '\\n')   -- works correctly on both engines
     
     Or rebuild the string via unhex(hex(col)):
     regexp_replace(unhex(hex(qbody)), '\\n', '\\\\n')   -- bug avoided
     
     Suspected root cause
   
     When regexp_replace operates on a Velox StringView pointing into the 
original
     column buffer, the LF byte (0x08+ control code) immediately preceded or 
followed
     by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
     boundary handling. Inline literals work because they go through a different
     code path that materializes the string before regex.
   
     Impact
   
     100% of rows containing both LF and CJK characters produce corrupt output.
     We discovered this when comparing 266k rows between Spark and Gluten in
     customer service text data — every single output row differs.
   
   ### Gluten version
   
   Gluten-1.3
   
   ### Spark version
   
   Spark-3.5.x
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Velox regexp_replace drops LF bytes when input is a NativeScan string column [gluten]

Reply via email to