Re: [I] Improve performance by reducing string copies [datafusion-sqlparser-rs]

via GitHub Sat, 18 Oct 2025 04:13:19 -0700


eyalsatori commented on issue #2036:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3406080563


   Some thoughts on this topic:  
   
   
   I've been thinking about the Arc suggestion from the previous discussion, 
and I see some potential issues:
   The tokenizer currently holds only a reference to the original query. If we 
use Arc, the library would need to create an owned copy of the query, which 
seems to miss the point of zero-copy tokenization. Additionally, since each 
copy would need to handle a different slice of the string, we'd need to 
implement something similar to the Bytes library.
   Alternative approach using lifetimes
   
   
   Another option is to introduce lifetimes directly into the Token enum:
   
   ```
   pub enum Token<'a> {
       /// An end-of-file marker, not a real token
       EOF,
       /// A keyword (like SELECT) or an optionally quoted SQL identifier
       Word(Word<'a>),
       /// An unsigned numeric literal
       Number(&'a str, bool),
       // ...
   }
   ```
   
   
   And the tokenize method would become:
   `pub fn tokenize(&'a mut self) -> Result<Vec<Token<'a>>, TokenizerError>` 
   
   
   However, this would bind the Token lifetimes to the original string. While 
this is the goal of this task, it could cause problems for current users who 
might not account for this and could be using the tokens after the original 
string is freed. Such a change could be quite disruptive for existing code.
   
   
   I'd like to suggest a different approach that I think could work better:
   
   1. Rename the current Token enum to BorrowedToken<'a> and recursively 
replace all String elements with Cow<'a, str>:
   ```
   pub enum BorrowedToken<'a> {
       /// An end-of-file marker, not a real token
       EOF,
       /// A keyword (like SELECT) or an optionally quoted SQL identifier
       Word(Word<'a>),
       /// An unsigned numeric literal
       Number(Cow<'a, str>, bool),
       // ...
   }
   ```
   
   2. Define a type alias:
   `type Token = BorrowedToken<'static>;`
   
   3. Create a new borrow_tokenize method that contains all the tokenization 
logic without copying strings:
   `pub fn borrow_tokenize(&'a mut self) -> Result<Vec<BorrowedToken<'a>>, 
TokenizerError>` 
   
   4. Keep the existing tokenize method signature but change its implementation:
   ```
   pub fn tokenize(&'a mut self) -> Result<Vec<Token>, TokenizerError> {
       let borrow_tokens = self.borrow_tokenize()?;
       let owned_tokens: Vec<BorrowedToken<'static>> = 
           // some code that calls to_owned() on all Cow variables inside the 
enum
       Ok(owned_tokens)
   }
   ``` 
   
   5. Update the parser to work with BorrowedToken
   
   Benefits of this approach:
   This way, I think we can achieve our zero-copy goal without introducing a 
massive breaking change to the API. Existing users would continue to work as 
before, while new users could opt into the borrowed version for better 
performance.
   
   I'm happy to work on a prototype if this direction looks promising. What are 
your thoughts? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance by reducing string copies [datafusion-sqlparser-rs]

Reply via email to