[issue30717] Add unicode grapheme cluster break algorithm

2017-08-07 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

> I don't think unicodedata is the right place

I do agree with that. A new module sounds good, would it be a problem if that 
module would contain very few functions at first?

> Can we mark this as having a Provisional API to give us time to decide on the 
> best API before locking it in permanently?

I'm not sure it's my call to make, but I would gladly consider that option.

> we should go through a PEP.

Why not. I may need a bit of guidance though.

> If you want state keeping for iterating over multiple  parts of 
> the string, you can use an iterator.

Sure thing. It just wasn't specified like this in the proto-PEP.

> The APIs were inspired by the standard string.find() APIs, that's why they 
> work on indexes and don't return Unicode strings. As such, they serve a 
> different use case than an iterator.

I personally like having a generator returning slice objects, as suggested 
above. What would be some good objections to this?

> Wouldn't this be a typical case where we'd expect a module to evolve and gain 
> usage on PyPI first, before adding it to the stdlib? [...] they might give 
> inspiration for a suitable API design

I'll give it a look.

> The well known library for Unicode support in C++ and Java is ICU

Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". ICU 
provides much much more than what I'm willing to provide. My goal here is just 
to "fix" how the python's standard library iterates over characters. Noting 
more, nothing less.

One might think that splitlines() should be "fixed" too, and there is clearly 
matter to discuss here. Same for words splitting. However, I do not intend to 
bring normalization, which you already have, collations, locale dependant 
upcasing or lowercasing, etc. We might need a wheel, but we don't have to take 
the whole truck.

How do we discuss all of this? Who's in charge of making decisions? How long 
should we debate? That's my first time contributing to Python and I'm new to 
all of that.

Thanks for your time.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

I have a few criticism to do against that proto-PEP

http://mail.python.org/pipermail/python-dev/2001-July/015938.html

In particular, the fact that all those functions return an index prevents any 
state keeping.

That's a problem because:

> next_(u, index) -> integer

As you've seen it, in grapheme clustering (as well as words and line breaking), 
we have to have an automaton to decide on the breaking point. Which means that 
starting at an arbitrary index is not possible.

> prev_(u, index) -> integer

Is it really necessary? It means implementing the same logic to go backward. In 
our current case, we'd need a backward grapheme cluster break automaton too.

> _start(u, index) -> integer
> _end(u, index) -> integer

Not doable in O(1) for the same reason as next_(). We need a 
context, and the code point itself cannot give enough information to know if 
it's the start/end of a given indextype.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] Add unicode grapheme cluster break algorithm

2017-08-03 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Thanks for your consideration. I'm currently fixing what's been asked in the 
reviews.

> But it would be useful to provide also word and sentence iterators.

I'll gladly do that as well!

> I think emitting a pair (pos, substring) would be more useful.

That means emitting a pair like ((start, end), substr) ? Is it pythonic to 
return a structure like this?

For what it's worth, I don't like it, but I definitely understand the value of 
it. I'd prefer having two versions. One returning indexes, the other returning 
substrings.

But...

> Alternatively an iterator could emit slice objects.

I really like that. Do we have a clear common agreement or preference on any 
option?

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-08-02 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hi,

Are you guys still interested? I haven't heard from you in a while

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello Steven!

Thanks for your reactivity!

unicodedata.grapheme_cluster_break() takes a unicode code point as an argument 
and return its GraphemeBreakProperty as a string. Possible values are listed 
here: http://www.unicode.org/reports/tr29/#CR

help(unicodedata.grapheme_cluster_break) says:
grapheme_cluster_break(chr, /)
Returns the GraphemeBreakProperty assigned to the character chr as string.



unicodedata.break_graphemes() takes a unicode string as argument and returns an 
GraphemeClusterIterator that spits consecutive graphemes clusters.

help(unicodedata.break_graphemes) says:

break_graphemes(unistr, /)
Returns an iterator to iterate over grapheme clusters in unistr.

It uses extended grapheme cluster rules from TR29.


Is there anything else you would like to know? Don't hesitate to ask :)

Thank you for your time!

--
assignee:  -> christian.heimes
components: +SSL, Tests, Tkinter -Library (Lib)
nosy: +christian.heimes

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12568] Add functions to get the width in columns of a character

2017-07-13 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello,

I come from bugs.python.org/issue30717 . I have a pending PR that needs review 
( https://github.com/python/cpython/pull/2673 ) adding a function that breaks 
unicode strings into grapheme clusters (aka what one would intuitively call "a 
character"). It's based on the grapheme cluster breaking algorithm from TR29.

Let me know if this is of any relevance.

Quick demo:
>>> a=unicodedata.break_graphemes("lol")
>>> list(a)
['l', 'o', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309l"))
['l', 'ỏ', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309\u0301l"))
['l', 'ỏ́', 'l']
>>> list(unicodedata.break_graphemes("lo\u0301l"))
['l', 'ó', 'l']
>>> list(unicodedata.break_graphemes(""))
[]

--
nosy: +Guillaume Sanchez

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12568>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-13 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello,

I implemented unicodedata.break_graphemes() that returns an iterators that 
spits consecutive graphemes.

This is a "test" implementation meant to see what doesn't fits Python's style 
and design, to discuss naming and implementation details.

https://github.com/python/cpython/pull/2673

Thanks for your time and interest

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-07-11 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Hello to all of you, sorry for the delay. Been busy.

I added the base code needed to built the grapheme cluster break algorithm. We 
now have the GraphemeBreakProperty available via 
unicodedata.grapheme_cluster_break()

Can you check that the implementation correctly fits the design? I was not sure 
about adding that prop to unicodedata_db ou unicodectype_db, tbh.

If it's all correct, I'll move forward with the automaton and the grapheme 
cluster breaking algorithm.

Thanks!

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Thanks for all those interesting cases you brought here! I didn't think of that 
at all!

I'm using the word "grapheme" as per the definition given in UAX TR29 which is 
*not* language/locale dependant [1].

This annex is very specific and precise about where to break "grapheme cluster" 
aka "when does a character starts and ends". Sadly, it's a bit more complex 
than just accumulating based on the Combining property. This annex gives a set 
of rules to implement, based on Grapheme_Cluster_Break property, and while 
those rules may naively be implemented as comparing adjacent pairs of code 
points, this is wrong and can be correctly and efficiently implemented as an 
automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided 
by Unicode).

We can definitely do a generator like you propose, or rather do it in the C 
layer to gain more efficiency and coherence since the other string / Unicode 
operations are in the C layer (upper, lower, casefold, etc)

Let me know what you guys think, what (and if) I should contribute :)

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] 
https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

Guillaume Sanchez added the comment:

Obviously, I'm talking about str.center() but all functions needing a count of 
graphemes are then not totally correct.

I can fix that and add the corresponding function, or an iterator over 
graphemes, or whatever seems right :)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30717] str.center() is not unicode aware

2017-06-20 Thread Guillaume Sanchez

New submission from Guillaume Sanchez:

"a⃑".center(width=5, fillchar=".")
produces
'..a⃑.' instead of '..a⃑..'

The reason is that "a⃑" is composed of two code points (2 UCS4 chars), one 'a' 
and one combining code point "above arrow". str.center() counts the size of the 
string and fills it both sides with `fillchar` until the size reaches `width`. 
However, this size is certainly intended to be the number of characters and not 
the number of code points.

The correct way to count characters is to use the grapheme clustering algorithm 
from UAX TR29.

Turns out I implemented this myself already, and might do the PR if asked so, 
with a little help to make the C <-> Python glue.

Thanks for your time.

--
components: Library (Lib)
messages: 296478
nosy: Guillaume Sanchez
priority: normal
severity: normal
status: open
title: str.center() is not unicode aware
versions: Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30717>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com