subject:"\[HACKERS\] Extra Vietnamese unaccent rules"

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-08-17 Thread Dang Minh Huong


Thanks!


On 2017/08/17 11:56, Tom Lane wrote:

Michael Paquier  writes:

On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane  wrote:

I'm not really qualified to review the Python coding
style, but I did fix a typo in a comment.

No pythonist here, but a large confusing "if" condition without any
comments is better if split up and explained with comments if that can
help in clarifying what the code is doing in any language, so thanks
for keeping the code intact.

Certainly agreed on splitting up the logic into multiple statements.
I just meant that I don't know enough Python to know if there are
better ways to do these tests.  (It probably doesn't matter, since
performance of this script is not an issue, and it's not likely to
undergo a lot of further development either.)

regards, tom lane




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-08-16 Thread Tom Lane

Michael Paquier  writes:
> On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane  wrote:
>> I'm not really qualified to review the Python coding
>> style, but I did fix a typo in a comment.

> No pythonist here, but a large confusing "if" condition without any
> comments is better if split up and explained with comments if that can
> help in clarifying what the code is doing in any language, so thanks
> for keeping the code intact.

Certainly agreed on splitting up the logic into multiple statements.
I just meant that I don't know enough Python to know if there are
better ways to do these tests.  (It probably doesn't matter, since
performance of this script is not an issue, and it's not likely to
undergo a lot of further development either.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-08-16 Thread Michael Paquier

On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane  wrote:
> Pushed into v11.

Thanks.

> I'm not really qualified to review the Python coding
> style, but I did fix a typo in a comment.

No pythonist here, but a large confusing "if" condition without any
comments is better if split up and explained with comments if that can
help in clarifying what the code is doing in any language, so thanks
for keeping the code intact.

> BTW, while this isn't a reason to delay this patch, I wonder whether
> the regression test for unaccent is really adequate.  According to
> https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html
> it isn't doing anything to check multicharacter source strings, and
> what's considerably more disturbing, it isn't exercising the PG_CATCH
> code that's meant to deal with characters outside the current database's
> encoding.

Yeah, that could be improved a bit.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-08-16 Thread Tom Lane

Dang Minh Huong  writes:
> On 2017/07/05 15:28, Michael Paquier wrote:
>> (Surprised to see that generate_unaccent_rules.py is inconsistent on
>> MacOS, runs fine on Linux).

FWIW, I got identical results from running the script on current macOS
(Sierra) and Linux (RHEL6).

>> Testing with characters having two accents, the results are produced
>> as wanted. I am attaching an updated patch with all those
>> simplifications. Thoughts?

> Thanks, so pretty. The patch is fine to me.

Pushed into v11.  I'm not really qualified to review the Python coding
style, but I did fix a typo in a comment.

BTW, while this isn't a reason to delay this patch, I wonder whether
the regression test for unaccent is really adequate.  According to
https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html
it isn't doing anything to check multicharacter source strings, and
what's considerably more disturbing, it isn't exercising the PG_CATCH
code that's meant to deal with characters outside the current database's
encoding.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-07-05 Thread Dang Minh Huong


On 2017/07/05 15:28, Michael Paquier wrote:

I have finally been able to look at this patch.


Thanks for reviewing and the new version of the patch.

(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

  def get_plain_letter(codepoint, table):
  """Return the base codepoint without marks."""
  if is_letter_with_marks(codepoint, table):
-return table[codepoint.combining_ids[0]]
+if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+# Recursive to find the plain letter
+return get_plain_letter(table[codepoint.combining_ids[0]],table)
+elif is_plain_letter(table[codepoint.combining_ids[0]]):
+return table[codepoint.combining_ids[0]]
+else:
+return None
  elif is_plain_letter(codepoint):
  return codepoint
  else:
-raise "mu"
+return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

  def is_letter_with_marks(codepoint, table):
-"""Returns true for plain letters combined with one or more marks."""
+"""Returns true for letters combined with one or more marks."""
  # See 
http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
  return len(codepoint.combining_ids) > 1 and \
-   is_plain_letter(table[codepoint.combining_ids[0]]) and \
+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
 all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

  if is_letter_with_marks(codepoint, table):
-charactersSet.add((codepoint.id,
+if get_plain_letter(codepoint, table) <> None:
+charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?


Thanks, so pretty. The patch is fine to me.

---
Thanks and best regards,
Dang Minh Huong

---
このEメールはアバスト アンチウイルスによりウイルススキャンされています。
https://www.avast.com/antivirus



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-07-04 Thread Michael Paquier

On Wed, Jun 7, 2017 at 1:06 AM, Man Trieu  wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian :
>>
>> On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
>> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian  wrote:
>> >  Shouldn't you use "or is_letter_with_marks()", instead of "or
>> >  len(...)
>> > > 1"?  Your test might catch something that isn't based on a
>> > > 'letter'
>> >  (according to is_plain_letter).  Otherwise this looks pretty good
>> >  to
>> >  me.  Please add it to the next commitfest.
>> > >>>
>> > >>> Thanks for confirm, sir.
>> > >>> I will add it to the next CF soon.
>> > >>
>> > >> Sorry for lately response. I attach the update patch.
>> > >
>> > > Uh, there is no patch attached.
>> > >
>> >
>> > Sorry sir, reattach the patch.
>> > I also added it to the next CF and set reviewers to Thomas Munro. Could
>> > you confirm for me.
>>
>> There seems to be a problem.  I can't see a patch dated 2017-06-07 on
>> the commitfest page:
>>
>> https://commitfest.postgresql.org/14/1161/
>>
>> I added the thread but there was no change.  (I think the thread was
>> already present.)  It appears it is not seeing this patch as the latest
>> patch.
>>
>> Does anyone know why this is happening?
>
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.

I have finally been able to look at this patch.
(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

 def get_plain_letter(codepoint, table):
 """Return the base codepoint without marks."""
 if is_letter_with_marks(codepoint, table):
-return table[codepoint.combining_ids[0]]
+if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+# Recursive to find the plain letter
+return get_plain_letter(table[codepoint.combining_ids[0]],table)
+elif is_plain_letter(table[codepoint.combining_ids[0]]):
+return table[codepoint.combining_ids[0]]
+else:
+return None
 elif is_plain_letter(codepoint):
 return codepoint
 else:
-raise "mu"
+return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

 def is_letter_with_marks(codepoint, table):
-"""Returns true for plain letters combined with one or more marks."""
+"""Returns true for letters combined with one or more marks."""
 # See 
http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
 return len(codepoint.combining_ids) > 1 and \
-   is_plain_letter(table[codepoint.combining_ids[0]]) and \
+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

 if is_letter_with_marks(codepoint, table):
-charactersSet.add((codepoint.id,
+if get_plain_letter(codepoint, table) <> None:
+charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?
-- 
Michael
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index a5eb42f0b1..9135ec23ce 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -48,24 +48,47 @@ def is_mark(codepoint):
 return codepoint.general_category in ("Mn", "Me", "Mc")
 
 def is_letter_with_marks(codepoint, table):
-"""Returns true for plain letters combined with one or more marks."""
+"""Returns true for letters combined with one or more marks."""
 # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
-return len(codepoint.combining_ids) > 1 and \
-   is_plain_letter(table[codepoint.combining_ids[0]]) and \
-   all(is_mark(table[i]) for i in codepoint.combining_ids[1:])
+
+# Letter may have no combining characters, in which case it has
+# no marks.
+if len(codepoint.combining_ids) == 1:
+return False
+
+# A letter without diatritical marks has none of them.
+if any(is_mark(table[i]) for i in codepoint.combining_ids[1:]) is False:
+return False
+
+# Check if the base letter of this letter has marks.
+codepoint_base = codepoint.combining_ids[0]
+if (is_plain_letter(table[codepoint_base]) is False and \
+is_letter_with_marks(table[codepoint_base], table) is False):
+return False
+
+

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Bruce Momjian

On Tue, Jun  6, 2017 at 12:15:13PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > There seems to be a problem.  I can't see a patch dated 2017-06-07 on
> > the commitfest page:
> > https://commitfest.postgresql.org/14/1161/
> 
> It looks to me like the patch is buried inside a multipart/alternative
> MIME section.  That's evidently causing our mail archives to miss its
> presence.  The latest message does show as having an attachment in the
> archives, but I think there's some delay before the CF app will notice.

OK, I see had picked up my email as the lastest, not the latest patch. 
I see now the second patch email appears properly on the webpage, so we
are good:

https://commitfest.postgresql.org/14/1161/

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Tom Lane

Bruce Momjian  writes:
> There seems to be a problem.  I can't see a patch dated 2017-06-07 on
> the commitfest page:
>   https://commitfest.postgresql.org/14/1161/

It looks to me like the patch is buried inside a multipart/alternative
MIME section.  That's evidently causing our mail archives to miss its
presence.  The latest message does show as having an attachment in the
archives, but I think there's some delay before the CF app will notice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Bruce Momjian

On Wed, Jun  7, 2017 at 01:06:22AM +0900, Man Trieu wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian :
> I added the thread but there was no change.  (I think the thread was
> already present.)  It appears it is not seeing this patch as the latest
> patch.
> 
> Does anyone know why this is happening?
> 
> 
> 
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.

It is getting weirder.  It has picked up my email report of a commitfest
problem as the latest patch (even though there was no patch), and your
second posting is not listed:

https://commitfest.postgresql.org/14/1161/

I think we need someone who knows the rules of how the commitfest finds
patches.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Man Trieu

2017-06-07 0:31 GMT+09:00 Bruce Momjian :

> On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian  wrote:
> >  Shouldn't you use "or is_letter_with_marks()", instead of "or
> len(...)
> > > 1"?  Your test might catch something that isn't based on a 'letter'
> >  (according to is_plain_letter).  Otherwise this looks pretty good to
> >  me.  Please add it to the next commitfest.
> > >>>
> > >>> Thanks for confirm, sir.
> > >>> I will add it to the next CF soon.
> > >>
> > >> Sorry for lately response. I attach the update patch.
> > >
> > > Uh, there is no patch attached.
> > >
> >
> > Sorry sir, reattach the patch.
> > I also added it to the next CF and set reviewers to Thomas Munro. Could
> you confirm for me.
>
> There seems to be a problem.  I can't see a patch dated 2017-06-07 on
> the commitfest page:
>
> https://commitfest.postgresql.org/14/1161/
>
> I added the thread but there was no change.  (I think the thread was
> already present.)  It appears it is not seeing this patch as the latest
> patch.
>
> Does anyone know why this is happening?
>


May be due to my Mac's mailer? Sorry but I try one more time to attach the
patch by webmail.

---
Thanks and best regards,
Dang Minh Huong


unaccent.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Bruce Momjian

On Wed, Jun  7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
> > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian  wrote:
>  Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> > 1"?  Your test might catch something that isn't based on a 'letter'
>  (according to is_plain_letter).  Otherwise this looks pretty good to
>  me.  Please add it to the next commitfest.
> >>> 
> >>> Thanks for confirm, sir.
> >>> I will add it to the next CF soon.
> >> 
> >> Sorry for lately response. I attach the update patch.
> > 
> > Uh, there is no patch attached.
> > 
> 
> Sorry sir, reattach the patch.
> I also added it to the next CF and set reviewers to Thomas Munro. Could you 
> confirm for me.

There seems to be a problem.  I can't see a patch dated 2017-06-07 on
the commitfest page:

https://commitfest.postgresql.org/14/1161/

I added the thread but there was no change.  (I think the thread was
already present.)  It appears it is not seeing this patch as the latest
patch.

Does anyone know why this is happening?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-06 Thread Dang Minh Huong

On Jun 4, 29 Heisei, at 00:48, Bruce Momjian  wrote:On Sun, Jun  4, 2017 at 12:43:17AM +0900, Dang Minh Huong wrote:On May 30, 29 Heisei, at 00:22, Dang Minh Huong  wrote:On May 29, 29 Heisei, at 10:47, Thomas Munro > wrote:On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong > wrote:Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.-   is_plain_letter(table[codepoint.combining_ids[0]]) and \+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\+    len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)1"?  Your test might catch something that isn't based on a 'letter'(according to is_plain_letter).  Otherwise this looks pretty good tome.  Please add it to the next commitfest.Thanks for confirm, sir.I will add it to the next CF soon.Sorry for lately response. I attach the update patch.Uh, there is no patch attached.Sorry sir, reattach the patch.I also added it to the next CF and set reviewers to Thomas Munro. Could you confirm for me.

unaccent.patch
Description: Binary data
---Thanks and best regards,Dang Minh Huong

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-03 Thread Michael Paquier

On Mon, May 29, 2017 at 10:47 AM, Thomas Munro
 wrote:
>> [Quoting Michael]
>>> Actually, with the recent work that has been done with
>>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>>> user-friendly tables, shouldn't the python script of unaccent/ be
>>> replaced by something that works on this table? This does a canonical
>>> decomposition but just keeps the first characters with a class
>>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>>> and let caller do decision making with the result returned.
>>
>> Thanks, i will learning about it.
>
> It seems like that could be useful for runtime use (I'm sure there is
> a whole world of Unicode support we could add), but here we're only
> trying to generate a mapping file to add to the source tree, so I'm
> not sure how it's relevant.

Yes, that's what I am coming at, but that would be really dictionnary
specific and that would be roughly to provide a fast-path equivalent
to the tsearch_readline* routines working on files. The addition of
new infrastructure may perhaps not be worth the performance gains.
Definitely for this fix there is no need to do anything more
complicated than tweaking the script to generate the rules.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-06-03 Thread Dang Minh Huong

On May 30, 29 Heisei, at 00:22, Dang Minh Huong  wrote:

unaccent.patch
Description: Binary data
On May 29, 29 Heisei, at 10:47, Thomas Munro  wrote:On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong  wrote:Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.-   is_plain_letter(table[codepoint.combining_ids[0]]) and \+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\+    len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)1"?  Your test might catch something that isn't based on a 'letter'(according to is_plain_letter).  Otherwise this looks pretty good tome.  Please add it to the next commitfest.Thanks for confirm, sir.I will add it to the next CF soon.Sorry for lately response. I attach the update patch.---Thanks and best regards,Dang Minh Huong

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-29 Thread Dang Minh Huong


> On May 29, 29 Heisei, at 10:47, Thomas Munro  
> wrote:
> 
> On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong  wrote:
>> Thanks for reporting and lecture about unicode.
>> I attached a patch as the instruction from Thomas. Could you confirm it.
> 
> -   is_plain_letter(table[codepoint.combining_ids[0]]) and \
> +   (is_plain_letter(table[codepoint.combining_ids[0]]) or\
> +len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \
> 
> Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
>> 1"?  Your test might catch something that isn't based on a 'letter'
> (according to is_plain_letter).  Otherwise this looks pretty good to
> me.  Please add it to the next commitfest.

Thanks for confirm, sir.
I will add it to the next CF soon.

> I expect that some users in Vietnam will consider this to be a bugfix,
> which raises the question of whether to backpatch it.  Perhaps we
> could consider fixing it for 10.  Then users of older versions could
> grab the rules file from 10 to use with 9.whatever if they want to do
> that and reindex their data as appropriate.

I am also inclined to the fixing it for 10, because it will not affect to 
current users.
But do you want to back-patch to all supported versions Kha Nguyen?
# I would also want to note that, not only Vietnamese characters were missed to 
add from the rule list.


---
Thanks and best regards,
Dang Minh Huong

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-28 Thread Thomas Munro

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong  wrote:
> [Quoting Thomas]
>> You don't have to worry about decoding that line, it's all done in
>> that Python script.  The problem is just in the function
>> is_letter_with_marks().  Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
> Thanks for reporting and lecture about unicode.
> I attached a patch as the instruction from Thomas. Could you confirm it.

-   is_plain_letter(table[codepoint.combining_ids[0]]) and \
+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> 1"?  Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter).  Otherwise this looks pretty good to
me.  Please add it to the next commitfest.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it.  Perhaps we
could consider fixing it for 10.  Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

> [Quoting Michael]
>> Actually, with the recent work that has been done with
>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>> user-friendly tables, shouldn't the python script of unaccent/ be
>> replaced by something that works on this table? This does a canonical
>> decomposition but just keeps the first characters with a class
>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>> and let caller do decision making with the result returned.
>
> Thanks, i will learning about it.

It seems like that could be useful for runtime use (I'm sure there is
a whole world of Unicode support we could add), but here we're only
trying to generate a mapping file to add to the source tree, so I'm
not sure how it's relevant.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-28 Thread Dang Minh Huong

Hi,

unaccent.patch
Description: Binary data
I am interested in this thread.On May 27, 29 Heisei, at 10:41, Michael Paquier wrote:On Fri, May 26, 2017 at 5:48 PM, Thomas Munro wrote:Unicode has two ways to represent characters with accents: either withcomposed codepoints like "é" or decomposed codepoints where you say"e" and then "´". The field "00E2 0301" is the decomposed form ofthat character above. Our job here is to identify the basic letterthat each composed character contains, by analysing the decomposedfield that you see in that line. I failed to realise that characterswith TWO accents are described as a composed character with ONE accentplus another accent.Doesn't that depend on the NF operation you are working on? With acanonical decomposition it seems to me that a character with twoaccents can as well be decomposed with one character and two composingcharacter accents (NFKC does a canonical decomposition in one of itssteps).You don't have to worry about decoding that line, it's all done inthat Python script. The problem is just in the functionis_letter_with_marks(). Instead of just checking if combining_ids[0]is a plain letter, it looks like it should also check ifcombining_ids[0] itself is a letter with marks. Also get_plain_letterwould need to be able to recurse to extract the "a".Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.Actually, with the recent work that has been done withunicode_norm_table.h which has been to transpose UnicodeData.txt intouser-friendly tables, shouldn't the python script of unaccent/ bereplaced by something that works on this table? This does a canonicaldecomposition but just keeps the first characters with a classordering of 0. So we have basic APIs able to look at UnicodeData.txtand let caller do decision making with the result returned.-- MichaelThanks, i will learning about it.---Dang Minh Huong

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-27 Thread Kha Nguyen

Does this mean that the python script has to be updated to be recursive too?

> On 27 May 2017, at 0.48, Thomas Munro  wrote:
> 
> On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen  wrote:
>> Could you explain to me what this line means:
>> “
>> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
>> 0301N;;;1EA4;;1EA4
>> “
>> 
>> If you could give me an example of adding a rule for “recursive” case, I can 
>> do the rest. I am not familiar with this unaccent format generation yet.
> 
> So contrib/unaccent/generate_unaccent_rules.py is a Python script that
> takes UnicodeData.txt, a list of information about all Unicode
> codepoints available at a URL that is shown in a comment, and
> generates unaccent.rules.  The idea was to avoid having to change it
> manually every time someone finds characters that should be in there
> (as you have just done!) by doing it systematically.
> 
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´".  The field "00E2 0301" is the decomposed form of
> that character above.  Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line.  I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.
> 
> You don't have to worry about decoding that line, it's all done in
> that Python script.  The problem is just in the function
> is_letter_with_marks().  Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
> would need to be able to recurse to extract the "a".
> 
> I hope that helps!
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-27 Thread Kha Nguyen

Could you explain to me what this line means:
“
1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301N;;;1EA4;;1EA4
“

If you could give me an example of adding a rule for “recursive” case, I can do 
the rest. I am not familiar with this unaccent format generation yet.

Thanks
Kha

> On 26 May 2017, at 21.19, Thomas Munro  wrote:
> 
> On Sat, May 27, 2017 at 5:13 AM, Tom Lane  wrote:
>> I wrote:
>>> Nguyen Le Hoang Kha  writes:
 Most of the time in Vietnamese language, there are up to 2 accents in a
 character. These unaccent rules are added to handle such cases (which are
 very common).
>> 
>>> I can't see any reason not to add these --- any objections out there?
>> 
>> Oh, wait a minute.  Patching unaccent.rules directly isn't the way
>> to do this; that file is supposed to be generated by
>> generate_unaccent_rules.py.  Can you see how to modify that script
>> to produce these rules?
> 
> Looking at one example from this patch:
> 
> UTF8: 
> Codepoint: 1EA5
> Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
> 
> In UnicodData.txt it's this line:
> 
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301N;;;1EA4;;1EA4
> 
> The problem is that generate_unaccent_rules.py assumes that the
> composing data is a plain letter followed by some number of
> diacritical modifiers.  That's true for the characters with a single
> accent, but in this multi-accent case it's *composed* character 00E2
> (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
> (COMBINING ACCENT ACUTE).  So we need to teach it to be recursive.
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Michael Paquier

On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
 wrote:
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´".  The field "00E2 0301" is the decomposed form of
> that character above.  Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line.  I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.

Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
steps).

> You don't have to worry about decoding that line, it's all done in
> that Python script.  The problem is just in the function
> is_letter_with_marks().  Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks.  Also get_plain_letter
> would need to be able to recurse to extract the "a".

Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Thomas Munro

On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen  wrote:
> Could you explain to me what this line means:
> “
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301N;;;1EA4;;1EA4
> “
>
> If you could give me an example of adding a rule for “recursive” case, I can 
> do the rest. I am not familiar with this unaccent format generation yet.

So contrib/unaccent/generate_unaccent_rules.py is a Python script that
takes UnicodeData.txt, a list of information about all Unicode
codepoints available at a URL that is shown in a comment, and
generates unaccent.rules.  The idea was to avoid having to change it
manually every time someone finds characters that should be in there
(as you have just done!) by doing it systematically.

Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".

I hope that helps!

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Thomas Munro

On Sat, May 27, 2017 at 5:13 AM, Tom Lane  wrote:
> I wrote:
>> Nguyen Le Hoang Kha  writes:
>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>> character. These unaccent rules are added to handle such cases (which are
>>> very common).
>
>> I can't see any reason not to add these --- any objections out there?
>
> Oh, wait a minute.  Patching unaccent.rules directly isn't the way
> to do this; that file is supposed to be generated by
> generate_unaccent_rules.py.  Can you see how to modify that script
> to produce these rules?

Looking at one example from this patch:

UTF8: 
Codepoint: 1EA5
Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE

In UnicodData.txt it's this line:

1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301N;;;1EA4;;1EA4

The problem is that generate_unaccent_rules.py assumes that the
composing data is a plain letter followed by some number of
diacritical modifiers.  That's true for the characters with a single
accent, but in this multi-accent case it's *composed* character 00E2
(LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
(COMBINING ACCENT ACUTE).  So we need to teach it to be recursive.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Tom Lane

I wrote:
> Nguyen Le Hoang Kha  writes:
>> Most of the time in Vietnamese language, there are up to 2 accents in a
>> character. These unaccent rules are added to handle such cases (which are
>> very common).

> I can't see any reason not to add these --- any objections out there?

Oh, wait a minute.  Patching unaccent.rules directly isn't the way
to do this; that file is supposed to be generated by
generate_unaccent_rules.py.  Can you see how to modify that script
to produce these rules?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Tom Lane

Nguyen Le Hoang Kha  writes:
> Most of the time in Vietnamese language, there are up to 2 accents in a
> character. These unaccent rules are added to handle such cases (which are
> very common).

I can't see any reason not to add these --- any objections out there?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Extra Vietnamese unaccent rules

2017-05-26 Thread Nguyen Le Hoang Kha

Most of the time in Vietnamese language, there are up to 2 accents in a
character. These unaccent rules are added to handle such cases (which are
very common).

Kha Nguyen | nlhkh@github


vietnamese-unaccent-rules.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

Re: [HACKERS] Extra Vietnamese unaccent rules

[HACKERS] Extra Vietnamese unaccent rules

25 matches

Site Navigation

Mail list logo

Footer information