[issue26917] unicodedata.normalize(): bug in Hangul Composition

2018-06-17 Thread Benjamin Peterson


Change by Benjamin Peterson :


--
resolution:  -> fixed
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2018-06-16 Thread Ma Lin


Ma Lin  added the comment:

This issue can be closed, already fixed in issue29456

Also, PyPy's current code is correct.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2018-03-18 Thread Ma Lin

Ma Lin  added the comment:

> Victor's patch is correct.

I'm afraid you are wrong.
Please see PR 1958 in issue29456, IMO this PR can be merged.

--
nosy: +Ma Lin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2018-03-18 Thread Ronan Lamy

Ronan Lamy  added the comment:

Victor's patch is correct. I implemented the same fix in PyPy in 
https://bitbucket.org/pypy/pypy/commits/92b4fb5b9e58

--
nosy: +Ronan.Lamy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2016-05-03 Thread Armin Rigo

Armin Rigo added the comment:

See also 
https://bitbucket.org/pypy/pypy/issues/2289/incorrect-unicode-normalization .  
It seems that you reached the same conclusion than the OP in that issue: the 
problem would really be that normalizing "\uafb8\u11a7" should not drop the 
second character.  Both Python and PyPy do that, but Python adds the 
"is_normalized()" check, so in some cases it returns the correct unmodified 
result.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2016-05-03 Thread STINNER Victor

STINNER Victor added the comment:

Attached patch changes Hangul Composition. I'm not sure that it is correct.

--
keywords: +patch
Added file: http://bugs.python.org/file42691/hangul_composition.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2016-05-03 Thread STINNER Victor

STINNER Victor added the comment:

Extract of nfc_nfkc():

  /* Hangul Composition. We don't need to check for 
 pairs, since we always have decomposed data. */
  code = PyUnicode_READ(kind, data, i);
  if (LBase <= code && code < (LBase+LCount) &&
  i + 1 < len &&
  VBase <= PyUnicode_READ(kind, data, i+1) &&
  PyUnicode_READ(kind, data, i+1) <= (VBase+VCount)) {
  int LIndex, VIndex;
  LIndex = code - LBase;
  VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
  code = SBase + (LIndex*VCount+VIndex)*TCount;
  i+=2;
  if (i < len &&
  TBase <= PyUnicode_READ(kind, data, i) &&
  PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
  code += PyUnicode_READ(kind, data, i)-TBase;
  i++;
  }
  output[o++] = code;
  continue;
  }

With the input string (1101 116e, 11a7), we get:

* LIndex = 1
* VIndex = 13


code = SBase + (LIndex*VCount+VIndex)*TCount + (ch3 - TBase)
= 0xAC00 + (1 * 21 + 13) * 28 + 0
= 0xafb8

Constants:

* LBase = 0x1100, LCount = 19
* VBase = 0x1161, VCount = 21
* TBase = 0x11A7, TCount = 28
* SBase = 0xAC00

The problem is maybe than we used the 3rd character whereas (ch3 - TBase) is 
equal to 0.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26917] unicodedata.normalize(): bug in Hangul Composition

2016-05-03 Thread STINNER Victor

Changes by STINNER Victor :


--
title: Inconsistency in unicodedata.normalize()? -> unicodedata.normalize(): 
bug in Hangul Composition

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com