On Wed, Jun 7, 2017 at 1:06 AM, Man Trieu <[email protected]> wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian <[email protected]>:
>>
>> On Wed, Jun 7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
>> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <[email protected]> wrote:
>> > >>>> Shouldn't you use "or is_letter_with_marks()", instead of "or
>> > >>>> len(...)
>> > >>>>> 1"? Your test might catch something that isn't based on a
>> > >>>>> 'letter'
>> > >>>> (according to is_plain_letter). Otherwise this looks pretty good
>> > >>>> to
>> > >>>> me. Please add it to the next commitfest.
>> > >>>
>> > >>> Thanks for confirm, sir.
>> > >>> I will add it to the next CF soon.
>> > >>
>> > >> Sorry for lately response. I attach the update patch.
>> > >
>> > > Uh, there is no patch attached.
>> > >
>> >
>> > Sorry sir, reattach the patch.
>> > I also added it to the next CF and set reviewers to Thomas Munro. Could
>> > you confirm for me.
>>
>> There seems to be a problem. I can't see a patch dated 2017-06-07 on
>> the commitfest page:
>>
>> https://commitfest.postgresql.org/14/1161/
>>
>> I added the thread but there was no change. (I think the thread was
>> already present.) It appears it is not seeing this patch as the latest
>> patch.
>>
>> Does anyone know why this is happening?
>
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.
I have finally been able to look at this patch.
(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).
def get_plain_letter(codepoint, table):
"""Return the base codepoint without marks."""
if is_letter_with_marks(codepoint, table):
- return table[codepoint.combining_ids[0]]
+ if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+ # Recursive to find the plain letter
+ return get_plain_letter(table[codepoint.combining_ids[0]],table)
+ elif is_plain_letter(table[codepoint.combining_ids[0]]):
+ return table[codepoint.combining_ids[0]]
+ else:
+ return None
elif is_plain_letter(codepoint):
return codepoint
else:
- raise "mu"
+ return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.
def is_letter_with_marks(codepoint, table):
- """Returns true for plain letters combined with one or more marks."""
+ """Returns true for letters combined with one or more marks."""
# See
http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
return len(codepoint.combining_ids) > 1 and \
- is_plain_letter(table[codepoint.combining_ids[0]]) and \
+ (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+ is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.
if is_letter_with_marks(codepoint, table):
- charactersSet.add((codepoint.id,
+ if get_plain_letter(codepoint, table) <> None:
+ charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.
Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?
--
Michael
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index a5eb42f0b1..9135ec23ce 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -48,24 +48,47 @@ def is_mark(codepoint):
return codepoint.general_category in ("Mn", "Me", "Mc")
def is_letter_with_marks(codepoint, table):
- """Returns true for plain letters combined with one or more marks."""
+ """Returns true for letters combined with one or more marks."""
# See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
- return len(codepoint.combining_ids) > 1 and \
- is_plain_letter(table[codepoint.combining_ids[0]]) and \
- all(is_mark(table[i]) for i in codepoint.combining_ids[1:])
+
+ # Letter may have no combining characters, in which case it has
+ # no marks.
+ if len(codepoint.combining_ids) == 1:
+ return False
+
+ # A letter without diatritical marks has none of them.
+ if any(is_mark(table[i]) for i in codepoint.combining_ids[1:]) is False:
+ return False
+
+ # Check if the base letter of this letter has marks.
+ codepoint_base = codepoint.combining_ids[0]
+ if (is_plain_letter(table[codepoint_base]) is False and \
+ is_letter_with_marks(table[codepoint_base], table) is False):
+ return False
+
+ return True
def is_letter(codepoint, table):
"""Return true for letter with or without diacritical marks."""
return is_plain_letter(codepoint) or is_letter_with_marks(codepoint, table)
def get_plain_letter(codepoint, table):
- """Return the base codepoint without marks."""
+ """Return the base codepoint without marks. If this codepoint has more
+ than one combining character, do a recursive lookup on the table to
+ find out its plain base letter."""
if is_letter_with_marks(codepoint, table):
- return table[codepoint.combining_ids[0]]
+ if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+ return get_plain_letter(table[codepoint.combining_ids[0]], table)
+ elif is_plain_letter(table[codepoint.combining_ids[0]]):
+ return table[codepoint.combining_ids[0]]
+
+ # Should not come here
+ assert(False)
elif is_plain_letter(codepoint):
return codepoint
- else:
- raise "mu"
+
+ # Should not come here
+ assert(False)
def is_ligature(codepoint, table):
"""Return true for letters combined with letters."""
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 84886da587..97f9ed47cf 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -254,6 +254,18 @@
ǒ o
Ǔ U
ǔ u
+Ǖ U
+ǖ u
+Ǘ U
+ǘ u
+Ǚ U
+ǚ u
+Ǜ U
+ǜ u
+Ǟ A
+ǟ a
+Ǡ A
+ǡ a
Ǥ G
ǥ g
Ǧ G
@@ -262,6 +274,8 @@
ǩ k
Ǫ O
ǫ o
+Ǭ O
+ǭ o
ǰ j
DZ DZ
Dz Dz
@@ -270,6 +284,8 @@
ǵ g
Ǹ N
ǹ n
+Ǻ A
+ǻ a
Ȁ A
ȁ a
Ȃ A
@@ -307,8 +323,14 @@
ȧ a
Ȩ E
ȩ e
+Ȫ O
+ȫ o
+Ȭ O
+ȭ o
Ȯ O
ȯ o
+Ȱ O
+ȱ o
Ȳ Y
ȳ y
ȴ l
@@ -441,6 +463,8 @@
ḅ b
Ḇ B
ḇ b
+Ḉ C
+ḉ c
Ḋ D
ḋ d
Ḍ D
@@ -451,10 +475,16 @@
ḑ d
Ḓ D
ḓ d
+Ḕ E
+ḕ e
+Ḗ E
+ḗ e
Ḙ E
ḙ e
Ḛ E
ḛ e
+Ḝ E
+ḝ e
Ḟ F
ḟ f
Ḡ G
@@ -471,6 +501,8 @@
ḫ h
Ḭ I
ḭ i
+Ḯ I
+ḯ i
Ḱ K
ḱ k
Ḳ K
@@ -479,6 +511,8 @@
ḵ k
Ḷ L
ḷ l
+Ḹ L
+ḹ l
Ḻ L
ḻ l
Ḽ L
@@ -497,6 +531,14 @@
ṉ n
Ṋ N
ṋ n
+Ṍ O
+ṍ o
+Ṏ O
+ṏ o
+Ṑ O
+ṑ o
+Ṓ O
+ṓ o
Ṕ P
ṕ p
Ṗ P
@@ -505,12 +547,20 @@
ṙ r
Ṛ R
ṛ r
+Ṝ R
+ṝ r
Ṟ R
ṟ r
Ṡ S
ṡ s
Ṣ S
ṣ s
+Ṥ S
+ṥ s
+Ṧ S
+ṧ s
+Ṩ S
+ṩ s
Ṫ T
ṫ t
Ṭ T
@@ -525,6 +575,10 @@
ṵ u
Ṷ U
ṷ u
+Ṹ U
+ṹ u
+Ṻ U
+ṻ u
Ṽ V
ṽ v
Ṿ V
@@ -563,12 +617,42 @@
ạ a
Ả A
ả a
+Ấ A
+ấ a
+Ầ A
+ầ a
+Ẩ A
+ẩ a
+Ẫ A
+ẫ a
+Ậ A
+ậ a
+Ắ A
+ắ a
+Ằ A
+ằ a
+Ẳ A
+ẳ a
+Ẵ A
+ẵ a
+Ặ A
+ặ a
Ẹ E
ẹ e
Ẻ E
ẻ e
Ẽ E
ẽ e
+Ế E
+ế e
+Ề E
+ề e
+Ể E
+ể e
+Ễ E
+ễ e
+Ệ E
+ệ e
Ỉ I
ỉ i
Ị I
@@ -577,10 +661,40 @@
ọ o
Ỏ O
ỏ o
+Ố O
+ố o
+Ồ O
+ồ o
+Ổ O
+ổ o
+Ỗ O
+ỗ o
+Ộ O
+ộ o
+Ớ O
+ớ o
+Ờ O
+ờ o
+Ở O
+ở o
+Ỡ O
+ỡ o
+Ợ O
+ợ o
Ụ U
ụ u
Ủ U
ủ u
+Ứ U
+ứ u
+Ừ U
+ừ u
+Ử U
+ử u
+Ữ U
+ữ u
+Ự U
+ự u
Ỳ Y
ỳ y
Ỵ Y
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers