[PR] Remove Whitespace Tokens from Parser [datafusion-sqlparser-rs]

via GitHub Wed, 29 Oct 2025 03:08:58 -0700


LucaCappelletti94 opened a new pull request, #2077:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/2077


   This PR implements a significant architectural refactoring by moving 
whitespace filtering from the parser to the tokenizer. Instead of emitting 
whitespace tokens (spaces, tabs, newlines, comments) and filtering them 
throughout the parser logic, the tokenizer now consumes whitespace during 
tokenization and never emits these tokens.
   
   While some duplicated logic still remains in the parser (to be addressed in 
future PRs), this change eliminates a substantial amount of looping overhead. 
This PR sets the groundwork for a cleaner streaming version, where the tokens 
are parsed simultaneously as the statements, with no parser memory and only 
local context passed between parser function calls.
   
   Fixes #2076
   
   ## Motivation
   
   As discussed in #2076, whitespace tokens were being filtered at numerous 
points throughout the parser. This approach had several drawbacks:
   
   - **Poor separation of concerns**: Whitespace handling was scattered across 
both tokenizer and parser
   - **Memory overhead**: Whitespace tokens were stored in memory unnecessarily
   - **Code duplication**: Multiple loops throughout the parser to skip 
whitespace tokens, looking ahead or backwards for non-whitespace tokens
   - **Performance**: Each token access required checking and skipping 
whitespace tokens
   
   The parser had extensive whitespace-handling logic scattered throughout:
   
   **Functions with whitespace-skipping loops:**
   - 
[`peek_tokens_with_location`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4028-L4049)
 - loops to skip whitespace
   - 
[`peek_tokens_ref`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4055-L4069)
 - loops to skip whitespace
   - 
[`peek_nth_token_ref`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4077-L4094)
 - loops to skip whitespace
   - 
[`advance_token`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4149-L4160)
 - loops to skip whitespace
   - 
[`prev_token`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4183-L4202)
 - loops backward to skip whitespace
   
   **Special variant functions that are now obsolete:**
   - 
[`peek_token_no_skip`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4096-L4100)
 - **removed entirely** (no longer needed)
   - 
[`peek_nth_token_no_skip`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4102-L4111)
 - **removed entirely** (no longer needed)
   - 
[`next_token_no_skip`](https://github.com/apache/datafusion-sqlparser-rs/blob/67684c84d4c2589356c411ea4917dcf1defcd77c/src/parser/mod.rs#L4140-L4144)
 - **removed entirely** (no longer needed)
   
   Since SQL is not a whitespace-sensitive language (unlike Python), so it 
*should be* safe to remove whitespace tokens entirely after tokenization.
   
   ## Handling Edge Cases
   
   While SQL is generally not whitespace-sensitive, there are specific edge 
cases that require careful consideration:
   
   ### 1. PostgreSQL COPY FROM STDIN
   
   The `COPY FROM STDIN` statement requires preserving the actual data content, 
which may include meaningful whitespace and newlines. The data section is 
treated as raw input that should be parsed according to the specified format 
(tab-delimited, CSV, etc.).
   
   **Solution**: The tokenizer now properly handles this by consuming the data 
as a single token. The parser then actually parses the body of the CSV-like 
string, which was not actually done correctly before this refactoring. I have 
extended the associated tests appropriately.
   
   ### 2. Hyphenated and path identifiers
   
   The tokenizer now includes enhanced logic for hyphenated identifier parsing 
with proper validation:
   
   - Detects when hyphens/paths/tildes are part of identifiers vs. operators
   - Validates that identifiers don't start with digits after hyphens
   - Ensures identifiers don't end with trailing hyphens
   - Handles the whitespace-dependent context correctly
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Remove Whitespace Tokens from Parser [datafusion-sqlparser-rs]

Reply via email to