looks fine except that processing all Unicode whitespace characters might add overhead to the parsing process, potentially impacting performance. Although I think this is a moot point
+1 Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Wed, 27 Mar 2024 at 22:57, Gengliang Wang <ltn...@gmail.com> wrote: > +1, this is a reasonable change. > > Gengliang > > On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com <se...@rielau.com> wrote: > >> Going once, going twice, …. last call for objections >> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com <se...@rielau.com>, >> wrote: >> >> Hello, >> >> I have a PR https://github.com/apache/spark/pull/45620 ready to go that >> will extend the definition of whitespace (what separates token) from the >> small set of ASCII characters space, tab, linefeed to those defined in >> Unicode. >> While this is a small and safe change, it is one where we would have a >> hard time changing our minds about later. >> It is also a change that, AFAIK, cannot be controlled under a config. >> >> What does the community think? >> >> Cheers >> Serge >> SQL Architect at Databricks >> >>