Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Mich Talebzadeh
looks fine except that processing all Unicode whitespace characters might
add overhead to the parsing process, potentially impacting performance.
Although I think this is a moot point

+1

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 27 Mar 2024 at 22:57, Gengliang Wang  wrote:

> +1, this is a reasonable change.
>
> Gengliang
>
> On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com  wrote:
>
>> Going once, going twice, …. last call for objections
>> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com ,
>> wrote:
>>
>> Hello,
>>
>> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
>> will extend the definition of whitespace (what separates token) from the
>> small set of ASCII characters space, tab, linefeed to those defined in
>> Unicode.
>> While this is a small and safe change, it is one where we would have a
>> hard time changing our minds about later.
>> It is also a change that, AFAIK, cannot be controlled under a config.
>>
>> What does the community think?
>>
>> Cheers
>> Serge
>> SQL Architect at Databricks
>>
>>


Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Gengliang Wang
+1, this is a reasonable change.

Gengliang

On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com  wrote:

> Going once, going twice, …. last call for objections
> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com ,
> wrote:
>
> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>


Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com
Going once, going twice, …. last call for objections
On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com , wrote:
Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks



Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com
Yeah I heard about that. This IMHO is a bit more worrying, and we do not have 
teh "excuse" that it is transparent.
Also, which of these would be STRING and which IDENTIFIER?

On Mar 25, 2024 at 1:06 PM -0700, Alex Cruise , wrote:
While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com 
mailto:se...@rielau.com>> wrote:
Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks



Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise
While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com  wrote:

> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>


Allowing Unicode Whitespace in Lexer

2024-03-23 Thread serge rielau . com
Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks