[PR] Implement zero-copy tokenization for identifiers, strings, and comments [datafusion-sqlparser-rs]

via GitHub Thu, 18 Dec 2025 12:50:08 -0800


eyalleshem opened a new pull request, #2136:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/2136


    This PR implements zero-copy tokenization by using borrowed strings 
(`&str`) instead of owned strings (`String`) for identifiers, string literals, 
and comments. This eliminates unnecessary string allocations during the 
tokenization
     process.
   
     ## Changes
   
     - Modified `Token` variants to store `&'a str` instead of `String` for:
       - `Word` tokens (identifiers like table/column names)
       - `SingleQuotedString` literals
       - `Whitespace`
       - Comments (single-line and multi-line)
     - Implemented case-insensitive keyword lookup without `to_uppercase()` 
allocation
     - Added `tokenize_bench` criterion benchmark for performance measurement
   
     ## Performance Impact
   
     Benchmark results using a complex 27KB SQL query with CTEs, joins, window 
functions, and extensive comments:
   
     tokenization/tokenize_complex_sql
         time:   [254.68 µs 254.81 µs 254.97 µs]
         change: [−60.885% −60.682% −60.482%] (p = 0.00 < 0.05)
         Performance has improved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Implement zero-copy tokenization for identifiers, strings, and comments [datafusion-sqlparser-rs]

Reply via email to