Re: [julia-users] Re: split a utf-8 string

Pontus Stenetorp Sun, 22 Nov 2015 03:30:07 -0800

On 22 November 2015 at 01:46,  <ele...@gmail.com> wrote:
>
> On Sunday, November 22, 2015 at 10:02:03 AM UTC+10, James Gilbert wrote:
>>
>> The spaces in your string are '\u3000' the ideographic space.
>> isspace('\u3000') returns true, and split(s) is supposed to split on all
>> space characters, so I think this might be a julia bug.
>
> Or a documentation bug, the actual default is only the ASCII spaces
> https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L62


It should probably be pointed out that at least Python3 (but not
Python2) gets it "right".

    > python3
    Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
    [GCC 5.2.1 20151010] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "Ｔｉｍｅ　ｆｌｉｅｓ　ｌｉｋｅ　ａｎ　ａｒｒｏｗ".split()
    ['Ｔｉｍｅ', 'ｆｌｉｅｓ', 'ｌｉｋｅ', 'ａｎ', 'ａｒｒｏｗ']

I would argue that Unicode is a first class citizen and that Julia
should also get this "right".  This would require some fairly
straightforward, yet not trivial, tinkering and would be an excellent
first contribution if someone wants to take a stab at it.

    Pontus

Re: [julia-users] Re: split a utf-8 string

Reply via email to