eyalsatori commented on issue #2036:
URL:
https://github.com/apache/datafusion-sqlparser-rs/issues/2036#issuecomment-3406080563
Some thoughts on this topic:
I've been thinking about the Arc suggestion from the previous discussion,
and I see some potential issues:
The tokenizer currently holds only a reference to the original query. If we
use Arc, the library would need to create an owned copy of the query, which
seems to miss the point of zero-copy tokenization. Additionally, since each
copy would need to handle a different slice of the string, we'd need to
implement something similar to the Bytes library.
Alternative approach using lifetimes
Another option is to introduce lifetimes directly into the Token enum:
```
pub enum Token<'a> {
/// An end-of-file marker, not a real token
EOF,
/// A keyword (like SELECT) or an optionally quoted SQL identifier
Word(Word<'a>),
/// An unsigned numeric literal
Number(&'a str, bool),
// ...
}
```
And the tokenize method would become:
`pub fn tokenize(&'a mut self) -> Result<Vec<Token<'a>>, TokenizerError>`
However, this would bind the Token lifetimes to the original string. While
this is the goal of this task, it could cause problems for current users who
might not account for this and could be using the tokens after the original
string is freed. Such a change could be quite disruptive for existing code.
I'd like to suggest a different approach that I think could work better:
1. Rename the current Token enum to BorrowedToken<'a> and recursively
replace all String elements with Cow<'a, str>:
```
pub enum BorrowedToken<'a> {
/// An end-of-file marker, not a real token
EOF,
/// A keyword (like SELECT) or an optionally quoted SQL identifier
Word(Word<'a>),
/// An unsigned numeric literal
Number(Cow<'a, str>, bool),
// ...
}
```
2. Define a type alias:
`type Token = BorrowedToken<'static>;`
3. Create a new borrow_tokenize method that contains all the tokenization
logic without copying strings:
`pub fn borrow_tokenize(&'a mut self) -> Result<Vec<BorrowedToken<'a>>,
TokenizerError>`
4. Keep the existing tokenize method signature but change its implementation:
```
pub fn tokenize(&'a mut self) -> Result<Vec<Token>, TokenizerError> {
let borrow_tokens = self.borrow_tokenize()?;
let owned_tokens: Vec<BorrowedToken<'static>> =
// some code that calls to_owned() on all Cow variables inside the
enum
Ok(owned_tokens)
}
```
5. Update the parser to work with BorrowedToken
Benefits of this approach:
This way, I think we can achieve our zero-copy goal without introducing a
massive breaking change to the API. Existing users would continue to work as
before, while new users could opt into the borrowed version for better
performance.
I'm happy to work on a prototype if this direction looks promising. What are
your thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]