Re: [PR] Prepare tokenizer for using borrowed strings instead of allocations. [datafusion-sqlparser-rs]

via GitHub Wed, 29 Oct 2025 07:42:40 -0700


iffyio commented on code in PR #2073:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/2073#discussion_r2473067695



##########
src/tokenizer.rs:
##########
@@ -769,6 +775,12 @@ impl State<'_> {
         self.peekable.peek()
     }
 
+    /// return the character after the next character (lookahead by 2) without 
advancing the stream
+    pub fn peek_next(&self) -> Option<char> {

Review Comment:
   ```suggestion
       /// Return the `nth` next character without advancing the stream.
       pub fn peek_nth(&self, n: usize) -> Option<char> {
   ```
   Thinking we can have this more generic so that it can be reused in other 
context? Similar to the `peek_nth*` parser methods



##########
src/tokenizer.rs:
##########
@@ -912,18 +926,21 @@ impl<'a> Tokenizer<'a> {
     fn tokenize_identifier_or_keyword(
         &self,
         ch: impl IntoIterator<Item = char>,
-        chars: &mut State,
+        chars: &mut State<'a>,
     ) -> Result<Option<Token>, TokenizerError> {
         chars.next(); // consume the first char
-        let ch: String = ch.into_iter().collect();
-        let word = self.tokenize_word(ch, chars);
+                      // Calculate total byte length without allocating a 
String
+        let consumed_byte_len: usize = ch.into_iter().map(|c| 
c.len_utf8()).sum();

Review Comment:
   I wonder if we can instead replace the `ch` parameter to this 
`tokenize_identifier_or_keyword` function with the actual `consumed_byte_len: 
usize`?
   
   If I'm understanding the intent/requirement of this change correctly the 
caller of this function should have that value on hand given that we seem to 
require that `ch` contains the preceeding characters in the stream?
   
   The current flow was initially unclear to me that the caller passes in an 
iterator of items, whose contents we did not use here, and it required digging 
further into `tokenize_word` to realised why that was the case.



##########
src/tokenizer.rs:
##########
@@ -769,6 +775,12 @@ impl State<'_> {
         self.peekable.peek()
     }
 
+    /// return the character after the next character (lookahead by 2) without 
advancing the stream
+    pub fn peek_next(&self) -> Option<char> {
+        // Use the source and byte_pos instead of cloning the peekable iterator

Review Comment:
   Can we add a guard here that the index is safe? So that we don't panic if we 
hit a bug or the API is being misused. e.g.
   ```rust
   if self.byte_pos >= self.source.len() {
       return None
   }
   ``` 



##########
src/tokenizer.rs:
##########
@@ -1876,13 +1893,26 @@ impl<'a> Tokenizer<'a> {
         comment
     }
 
-    /// Tokenize an identifier or keyword, after the first char is already 
consumed.
-    fn tokenize_word(&self, first_chars: impl Into<String>, chars: &mut State) 
-> String {
-        let mut s = first_chars.into();
-        s.push_str(&peeking_take_while(chars, |ch| {
-            self.dialect.is_identifier_part(ch)
-        }));
-        s
+    /// Tokenize an identifier or keyword, after the first char(s) have 
already been consumed.
+    /// `consumed_byte_len` is the byte length of the consumed character(s).
+    fn tokenize_word(&self, consumed_byte_len: usize, chars: &mut State<'a>) 
-> String {
+        // Calculate where the first character started
+        let first_char_byte_pos = chars.byte_pos - consumed_byte_len;
+
+        // Use the zero-copy version and convert to String
+        self.tokenize_word_borrowed(first_char_byte_pos, chars)
+            .to_string()
+    }
+
+    /// Tokenize an identifier or keyword, returning a borrowed slice when 
possible.
+    /// The first character position must be provided (before it was consumed).
+    /// Returns a slice with the same lifetime as the State's source.
+    fn tokenize_word_borrowed(&self, first_char_byte_pos: usize, chars: &mut 
State<'a>) -> &'a str {
+        // Consume the rest of the word
+        borrow_slice_until(chars, |ch| self.dialect.is_identifier_part(ch));
+
+        // Return a slice from the first char to the current position
+        &chars.source[first_char_byte_pos..chars.byte_pos]

Review Comment:
   Similarly here we can check that the indexing is safe and maybe return a 
literal `""` if not?



##########
src/tokenizer.rs:
##########
@@ -912,18 +926,21 @@ impl<'a> Tokenizer<'a> {
     fn tokenize_identifier_or_keyword(
         &self,
         ch: impl IntoIterator<Item = char>,
-        chars: &mut State,
+        chars: &mut State<'a>,
     ) -> Result<Option<Token>, TokenizerError> {
         chars.next(); // consume the first char
-        let ch: String = ch.into_iter().collect();
-        let word = self.tokenize_word(ch, chars);
+                      // Calculate total byte length without allocating a 
String

Review Comment:
   The indentation looks a bit off on this line?



##########
src/tokenizer.rs:
##########
@@ -1876,13 +1893,26 @@ impl<'a> Tokenizer<'a> {
         comment
     }
 
-    /// Tokenize an identifier or keyword, after the first char is already 
consumed.
-    fn tokenize_word(&self, first_chars: impl Into<String>, chars: &mut State) 
-> String {
-        let mut s = first_chars.into();
-        s.push_str(&peeking_take_while(chars, |ch| {
-            self.dialect.is_identifier_part(ch)
-        }));
-        s
+    /// Tokenize an identifier or keyword, after the first char(s) have 
already been consumed.
+    /// `consumed_byte_len` is the byte length of the consumed character(s).
+    fn tokenize_word(&self, consumed_byte_len: usize, chars: &mut State<'a>) 
-> String {
+        // Calculate where the first character started
+        let first_char_byte_pos = chars.byte_pos - consumed_byte_len;

Review Comment:
   Can we add a check that the operation doesn't overflow? e.g.
   ```rust
   if consumed_byte_len >= chars.byte_pos {
       return "".to_string()
   }
   ```



##########
src/tokenizer.rs:
##########
@@ -2176,35 +2206,82 @@ impl<'a> Tokenizer<'a> {
 /// Read from `chars` until `predicate` returns `false` or EOF is hit.
 /// Return the characters read as String, and keep the first non-matching
 /// char available as `chars.next()`.
-fn peeking_take_while(chars: &mut State, mut predicate: impl FnMut(char) -> 
bool) -> String {
-    let mut s = String::new();
+fn peeking_take_while(chars: &mut State, predicate: impl FnMut(char) -> bool) 
-> String {
+    borrow_slice_until(chars, predicate).to_string()
+}
+
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until<'a>(

Review Comment:
   ```suggestion
   fn peeking_take_while_ref<'a>(
   ```
   Thinking so that its clear that the function is related to 
`peeking_take_while`



##########
src/tokenizer.rs:
##########
@@ -2176,35 +2206,82 @@ impl<'a> Tokenizer<'a> {
 /// Read from `chars` until `predicate` returns `false` or EOF is hit.
 /// Return the characters read as String, and keep the first non-matching
 /// char available as `chars.next()`.
-fn peeking_take_while(chars: &mut State, mut predicate: impl FnMut(char) -> 
bool) -> String {
-    let mut s = String::new();
+fn peeking_take_while(chars: &mut State, predicate: impl FnMut(char) -> bool) 
-> String {
+    borrow_slice_until(chars, predicate).to_string()
+}
+
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters

Review Comment:
   Doc wise I think its easier to only reference the `peeking_take_while` 
method instead, and we mention only the difference to this function. That way 
we only describe the functionality once. Also, the project doesn't use the `# 
Arguments, # Returns etc` documentation format so that we can skip that for 
consistency I think.



##########
src/tokenizer.rs:
##########
@@ -743,8 +743,12 @@ impl std::error::Error for TokenizerError {}
 
 struct State<'a> {
     peekable: Peekable<Chars<'a>>,
+    /// Reference to the original source string being tokenized
+    source: &'a str,
     pub line: u64,
     pub col: u64,
+    /// Byte position in the source string
+    pub byte_pos: usize,

Review Comment:
   ```suggestion
       byte_pos: usize,
   ```
   



##########
src/tokenizer.rs:
##########
@@ -2176,35 +2206,82 @@ impl<'a> Tokenizer<'a> {
 /// Read from `chars` until `predicate` returns `false` or EOF is hit.
 /// Return the characters read as String, and keep the first non-matching
 /// char available as `chars.next()`.
-fn peeking_take_while(chars: &mut State, mut predicate: impl FnMut(char) -> 
bool) -> String {
-    let mut s = String::new();
+fn peeking_take_while(chars: &mut State, predicate: impl FnMut(char) -> bool) 
-> String {
+    borrow_slice_until(chars, predicate).to_string()
+}
+
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until<'a>(
+    chars: &mut State<'a>,
+    mut predicate: impl FnMut(char) -> bool,
+) -> &'a str {
+    // Record the starting byte position
+    let start_pos = chars.byte_pos;
+
+    // Consume characters while predicate is true
     while let Some(&ch) = chars.peek() {
         if predicate(ch) {
-            chars.next(); // consume
-            s.push(ch);
+            chars.next(); // consume (this updates byte_pos)
         } else {
             break;
         }
     }
-    s
+
+    // Get the ending byte position
+    let end_pos = chars.byte_pos;
+
+    // Return the slice from the original source
+    &chars.source[start_pos..end_pos]
 }
 
-/// Same as peeking_take_while, but also passes the next character to the 
predicate.
-fn peeking_next_take_while(
-    chars: &mut State,
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+/// This version also passes the next character to the predicate for lookahead.
+/// This is a zero-copy version of `peeking_next_take_while`.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters.
+///   Takes current char and optional next char for lookahead.
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until_next<'a>(

Review Comment:
   wondering, given this function is quite similar, would it make sense to 
implement `borrowed_slice_until` as following instead? To reuse the same 
definition/impl
   
   ```rust
   fn borrow_slice_until(chars) {
       borrow_slice_until_next(chars, |ch, _|{
           predicate(ch)
       })
   }
   ```



##########
src/tokenizer.rs:
##########
@@ -2176,35 +2206,82 @@ impl<'a> Tokenizer<'a> {
 /// Read from `chars` until `predicate` returns `false` or EOF is hit.
 /// Return the characters read as String, and keep the first non-matching
 /// char available as `chars.next()`.
-fn peeking_take_while(chars: &mut State, mut predicate: impl FnMut(char) -> 
bool) -> String {
-    let mut s = String::new();
+fn peeking_take_while(chars: &mut State, predicate: impl FnMut(char) -> bool) 
-> String {
+    borrow_slice_until(chars, predicate).to_string()
+}
+
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until<'a>(
+    chars: &mut State<'a>,
+    mut predicate: impl FnMut(char) -> bool,
+) -> &'a str {
+    // Record the starting byte position
+    let start_pos = chars.byte_pos;
+
+    // Consume characters while predicate is true
     while let Some(&ch) = chars.peek() {
         if predicate(ch) {
-            chars.next(); // consume
-            s.push(ch);
+            chars.next(); // consume (this updates byte_pos)
         } else {
             break;
         }
     }
-    s
+
+    // Get the ending byte position
+    let end_pos = chars.byte_pos;
+
+    // Return the slice from the original source
+    &chars.source[start_pos..end_pos]
 }
 
-/// Same as peeking_take_while, but also passes the next character to the 
predicate.
-fn peeking_next_take_while(
-    chars: &mut State,
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.

Review Comment:
   just flagging similar comments for `borrow_while_until` apply to this 
function



##########
src/tokenizer.rs:
##########
@@ -2176,35 +2206,82 @@ impl<'a> Tokenizer<'a> {
 /// Read from `chars` until `predicate` returns `false` or EOF is hit.
 /// Return the characters read as String, and keep the first non-matching
 /// char available as `chars.next()`.
-fn peeking_take_while(chars: &mut State, mut predicate: impl FnMut(char) -> 
bool) -> String {
-    let mut s = String::new();
+fn peeking_take_while(chars: &mut State, predicate: impl FnMut(char) -> bool) 
-> String {
+    borrow_slice_until(chars, predicate).to_string()
+}
+
+/// Borrow a slice from the original string until `predicate` returns `false` 
or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original 
source)
+/// * `predicate` - Function that returns true while we should continue taking 
characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until<'a>(
+    chars: &mut State<'a>,
+    mut predicate: impl FnMut(char) -> bool,
+) -> &'a str {
+    // Record the starting byte position

Review Comment:
   We can sanity check the index before  using it here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Prepare tokenizer for using borrowed strings instead of allocations. [datafusion-sqlparser-rs]

Reply via email to