Re: [HACKERS] Extra Vietnamese unaccent rules
Thanks! On 2017/08/17 11:56, Tom Lane wrote: Michael Paquier writes: On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane wrote: I'm not really qualified to review the Python coding style, but I did fix a typo in a comment. No pythonist here, but a large confusing "if" condition without any comments is better if split up and explained with comments if that can help in clarifying what the code is doing in any language, so thanks for keeping the code intact. Certainly agreed on splitting up the logic into multiple statements. I just meant that I don't know enough Python to know if there are better ways to do these tests. (It probably doesn't matter, since performance of this script is not an issue, and it's not likely to undergo a lot of further development either.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Michael Paquier writes: > On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane wrote: >> I'm not really qualified to review the Python coding >> style, but I did fix a typo in a comment. > No pythonist here, but a large confusing "if" condition without any > comments is better if split up and explained with comments if that can > help in clarifying what the code is doing in any language, so thanks > for keeping the code intact. Certainly agreed on splitting up the logic into multiple statements. I just meant that I don't know enough Python to know if there are better ways to do these tests. (It probably doesn't matter, since performance of this script is not an issue, and it's not likely to undergo a lot of further development either.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Thu, Aug 17, 2017 at 6:01 AM, Tom Lane wrote: > Pushed into v11. Thanks. > I'm not really qualified to review the Python coding > style, but I did fix a typo in a comment. No pythonist here, but a large confusing "if" condition without any comments is better if split up and explained with comments if that can help in clarifying what the code is doing in any language, so thanks for keeping the code intact. > BTW, while this isn't a reason to delay this patch, I wonder whether > the regression test for unaccent is really adequate. According to > https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html > it isn't doing anything to check multicharacter source strings, and > what's considerably more disturbing, it isn't exercising the PG_CATCH > code that's meant to deal with characters outside the current database's > encoding. Yeah, that could be improved a bit. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Dang Minh Huong writes: > On 2017/07/05 15:28, Michael Paquier wrote: >> (Surprised to see that generate_unaccent_rules.py is inconsistent on >> MacOS, runs fine on Linux). FWIW, I got identical results from running the script on current macOS (Sierra) and Linux (RHEL6). >> Testing with characters having two accents, the results are produced >> as wanted. I am attaching an updated patch with all those >> simplifications. Thoughts? > Thanks, so pretty. The patch is fine to me. Pushed into v11. I'm not really qualified to review the Python coding style, but I did fix a typo in a comment. BTW, while this isn't a reason to delay this patch, I wonder whether the regression test for unaccent is really adequate. According to https://coverage.postgresql.org/contrib/unaccent/unaccent.c.gcov.html it isn't doing anything to check multicharacter source strings, and what's considerably more disturbing, it isn't exercising the PG_CATCH code that's meant to deal with characters outside the current database's encoding. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On 2017/07/05 15:28, Michael Paquier wrote: I have finally been able to look at this patch. Thanks for reviewing and the new version of the patch. (Surprised to see that generate_unaccent_rules.py is inconsistent on MacOS, runs fine on Linux). def get_plain_letter(codepoint, table): """Return the base codepoint without marks.""" if is_letter_with_marks(codepoint, table): -return table[codepoint.combining_ids[0]] +if len(table[codepoint.combining_ids[0]].combining_ids) > 1: +# Recursive to find the plain letter +return get_plain_letter(table[codepoint.combining_ids[0]],table) +elif is_plain_letter(table[codepoint.combining_ids[0]]): +return table[codepoint.combining_ids[0]] +else: +return None elif is_plain_letter(codepoint): return codepoint else: -raise "mu" +return None The code paths returning None should not be reached, so I would suggest adding an assertion instead. Callers of get_plain_letter would blow up on None, still that would make future debugging harder. def is_letter_with_marks(codepoint, table): -"""Returns true for plain letters combined with one or more marks.""" +"""Returns true for letters combined with one or more marks.""" # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values return len(codepoint.combining_ids) > 1 and \ - is_plain_letter(table[codepoint.combining_ids[0]]) and \ + (is_plain_letter(table[codepoint.combining_ids[0]]) or\ +is_letter_with_marks(table[codepoint.combining_ids[0]],table)) and \ all(is_mark(table[i]) for i in codepoint.combining_ids[1:] This was already hard to follow, and this patch makes its harder. I think that the thing should be refactored with multiple conditions. if is_letter_with_marks(codepoint, table): -charactersSet.add((codepoint.id, +if get_plain_letter(codepoint, table) <> None: +charactersSet.add((codepoint.id, This change is not necessary as a letter with marks is not a plain character anyway. Testing with characters having two accents, the results are produced as wanted. I am attaching an updated patch with all those simplifications. Thoughts? Thanks, so pretty. The patch is fine to me. --- Thanks and best regards, Dang Minh Huong --- このEメールはアバスト アンチウイルスによりウイルススキャンされています。 https://www.avast.com/antivirus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Wed, Jun 7, 2017 at 1:06 AM, Man Trieu wrote: > 2017-06-07 0:31 GMT+09:00 Bruce Momjian : >> >> On Wed, Jun 7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote: >> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian wrote: >> > Shouldn't you use "or is_letter_with_marks()", instead of "or >> > len(...) >> > > 1"? Your test might catch something that isn't based on a >> > > 'letter' >> > (according to is_plain_letter). Otherwise this looks pretty good >> > to >> > me. Please add it to the next commitfest. >> > >>> >> > >>> Thanks for confirm, sir. >> > >>> I will add it to the next CF soon. >> > >> >> > >> Sorry for lately response. I attach the update patch. >> > > >> > > Uh, there is no patch attached. >> > > >> > >> > Sorry sir, reattach the patch. >> > I also added it to the next CF and set reviewers to Thomas Munro. Could >> > you confirm for me. >> >> There seems to be a problem. I can't see a patch dated 2017-06-07 on >> the commitfest page: >> >> https://commitfest.postgresql.org/14/1161/ >> >> I added the thread but there was no change. (I think the thread was >> already present.) It appears it is not seeing this patch as the latest >> patch. >> >> Does anyone know why this is happening? > > May be due to my Mac's mailer? Sorry but I try one more time to attach the > patch by webmail. I have finally been able to look at this patch. (Surprised to see that generate_unaccent_rules.py is inconsistent on MacOS, runs fine on Linux). def get_plain_letter(codepoint, table): """Return the base codepoint without marks.""" if is_letter_with_marks(codepoint, table): -return table[codepoint.combining_ids[0]] +if len(table[codepoint.combining_ids[0]].combining_ids) > 1: +# Recursive to find the plain letter +return get_plain_letter(table[codepoint.combining_ids[0]],table) +elif is_plain_letter(table[codepoint.combining_ids[0]]): +return table[codepoint.combining_ids[0]] +else: +return None elif is_plain_letter(codepoint): return codepoint else: -raise "mu" +return None The code paths returning None should not be reached, so I would suggest adding an assertion instead. Callers of get_plain_letter would blow up on None, still that would make future debugging harder. def is_letter_with_marks(codepoint, table): -"""Returns true for plain letters combined with one or more marks.""" +"""Returns true for letters combined with one or more marks.""" # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values return len(codepoint.combining_ids) > 1 and \ - is_plain_letter(table[codepoint.combining_ids[0]]) and \ + (is_plain_letter(table[codepoint.combining_ids[0]]) or\ +is_letter_with_marks(table[codepoint.combining_ids[0]],table)) and \ all(is_mark(table[i]) for i in codepoint.combining_ids[1:] This was already hard to follow, and this patch makes its harder. I think that the thing should be refactored with multiple conditions. if is_letter_with_marks(codepoint, table): -charactersSet.add((codepoint.id, +if get_plain_letter(codepoint, table) <> None: +charactersSet.add((codepoint.id, This change is not necessary as a letter with marks is not a plain character anyway. Testing with characters having two accents, the results are produced as wanted. I am attaching an updated patch with all those simplifications. Thoughts? -- Michael diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py index a5eb42f0b1..9135ec23ce 100644 --- a/contrib/unaccent/generate_unaccent_rules.py +++ b/contrib/unaccent/generate_unaccent_rules.py @@ -48,24 +48,47 @@ def is_mark(codepoint): return codepoint.general_category in ("Mn", "Me", "Mc") def is_letter_with_marks(codepoint, table): -"""Returns true for plain letters combined with one or more marks.""" +"""Returns true for letters combined with one or more marks.""" # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values -return len(codepoint.combining_ids) > 1 and \ - is_plain_letter(table[codepoint.combining_ids[0]]) and \ - all(is_mark(table[i]) for i in codepoint.combining_ids[1:]) + +# Letter may have no combining characters, in which case it has +# no marks. +if len(codepoint.combining_ids) == 1: +return False + +# A letter without diatritical marks has none of them. +if any(is_mark(table[i]) for i in codepoint.combining_ids[1:]) is False: +return False + +# Check if the base letter of this letter has marks. +codepoint_base = codepoint.combining_ids[0] +if (is_plain_letter(table[codepoint_base]) is False and \ +is_letter_with_marks(table[codepoint_base], table) is False): +return False + +
Re: [HACKERS] Extra Vietnamese unaccent rules
On Tue, Jun 6, 2017 at 12:15:13PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > There seems to be a problem. I can't see a patch dated 2017-06-07 on > > the commitfest page: > > https://commitfest.postgresql.org/14/1161/ > > It looks to me like the patch is buried inside a multipart/alternative > MIME section. That's evidently causing our mail archives to miss its > presence. The latest message does show as having an attachment in the > archives, but I think there's some delay before the CF app will notice. OK, I see had picked up my email as the lastest, not the latest patch. I see now the second patch email appears properly on the webpage, so we are good: https://commitfest.postgresql.org/14/1161/ -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Bruce Momjian writes: > There seems to be a problem. I can't see a patch dated 2017-06-07 on > the commitfest page: > https://commitfest.postgresql.org/14/1161/ It looks to me like the patch is buried inside a multipart/alternative MIME section. That's evidently causing our mail archives to miss its presence. The latest message does show as having an attachment in the archives, but I think there's some delay before the CF app will notice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Wed, Jun 7, 2017 at 01:06:22AM +0900, Man Trieu wrote: > 2017-06-07 0:31 GMT+09:00 Bruce Momjian : > I added the thread but there was no change. (I think the thread was > already present.) It appears it is not seeing this patch as the latest > patch. > > Does anyone know why this is happening? > > > > May be due to my Mac's mailer? Sorry but I try one more time to attach the > patch by webmail. It is getting weirder. It has picked up my email report of a commitfest problem as the latest patch (even though there was no patch), and your second posting is not listed: https://commitfest.postgresql.org/14/1161/ I think we need someone who knows the rules of how the commitfest finds patches. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
2017-06-07 0:31 GMT+09:00 Bruce Momjian : > On Wed, Jun 7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote: > > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian wrote: > > Shouldn't you use "or is_letter_with_marks()", instead of "or > len(...) > > > 1"? Your test might catch something that isn't based on a 'letter' > > (according to is_plain_letter). Otherwise this looks pretty good to > > me. Please add it to the next commitfest. > > >>> > > >>> Thanks for confirm, sir. > > >>> I will add it to the next CF soon. > > >> > > >> Sorry for lately response. I attach the update patch. > > > > > > Uh, there is no patch attached. > > > > > > > Sorry sir, reattach the patch. > > I also added it to the next CF and set reviewers to Thomas Munro. Could > you confirm for me. > > There seems to be a problem. I can't see a patch dated 2017-06-07 on > the commitfest page: > > https://commitfest.postgresql.org/14/1161/ > > I added the thread but there was no change. (I think the thread was > already present.) It appears it is not seeing this patch as the latest > patch. > > Does anyone know why this is happening? > May be due to my Mac's mailer? Sorry but I try one more time to attach the patch by webmail. --- Thanks and best regards, Dang Minh Huong unaccent.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Wed, Jun 7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote: > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian wrote: > Shouldn't you use "or is_letter_with_marks()", instead of "or len(...) > > 1"? Your test might catch something that isn't based on a 'letter' > (according to is_plain_letter). Otherwise this looks pretty good to > me. Please add it to the next commitfest. > >>> > >>> Thanks for confirm, sir. > >>> I will add it to the next CF soon. > >> > >> Sorry for lately response. I attach the update patch. > > > > Uh, there is no patch attached. > > > > Sorry sir, reattach the patch. > I also added it to the next CF and set reviewers to Thomas Munro. Could you > confirm for me. There seems to be a problem. I can't see a patch dated 2017-06-07 on the commitfest page: https://commitfest.postgresql.org/14/1161/ I added the thread but there was no change. (I think the thread was already present.) It appears it is not seeing this patch as the latest patch. Does anyone know why this is happening? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Jun 4, 29 Heisei, at 00:48, Bruce Momjianwrote:On Sun, Jun 4, 2017 at 12:43:17AM +0900, Dang Minh Huong wrote:On May 30, 29 Heisei, at 00:22, Dang Minh Huong wrote:On May 29, 29 Heisei, at 10:47, Thomas Munro > wrote:On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong > wrote:Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.- is_plain_letter(table[codepoint.combining_ids[0]]) and \+ (is_plain_letter(table[codepoint.combining_ids[0]]) or\+ len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)1"? Your test might catch something that isn't based on a 'letter'(according to is_plain_letter). Otherwise this looks pretty good tome. Please add it to the next commitfest.Thanks for confirm, sir.I will add it to the next CF soon.Sorry for lately response. I attach the update patch.Uh, there is no patch attached.Sorry sir, reattach the patch.I also added it to the next CF and set reviewers to Thomas Munro. Could you confirm for me. unaccent.patch Description: Binary data ---Thanks and best regards,Dang Minh Huong
Re: [HACKERS] Extra Vietnamese unaccent rules
On Mon, May 29, 2017 at 10:47 AM, Thomas Munro wrote: >> [Quoting Michael] >>> Actually, with the recent work that has been done with >>> unicode_norm_table.h which has been to transpose UnicodeData.txt into >>> user-friendly tables, shouldn't the python script of unaccent/ be >>> replaced by something that works on this table? This does a canonical >>> decomposition but just keeps the first characters with a class >>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt >>> and let caller do decision making with the result returned. >> >> Thanks, i will learning about it. > > It seems like that could be useful for runtime use (I'm sure there is > a whole world of Unicode support we could add), but here we're only > trying to generate a mapping file to add to the source tree, so I'm > not sure how it's relevant. Yes, that's what I am coming at, but that would be really dictionnary specific and that would be roughly to provide a fast-path equivalent to the tsearch_readline* routines working on files. The addition of new infrastructure may perhaps not be worth the performance gains. Definitely for this fix there is no need to do anything more complicated than tweaking the script to generate the rules. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On May 30, 29 Heisei, at 00:22, Dang Minh Huongwrote: unaccent.patch Description: Binary data On May 29, 29 Heisei, at 10:47, Thomas Munro wrote:On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong wrote:Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.- is_plain_letter(table[codepoint.combining_ids[0]]) and \+ (is_plain_letter(table[codepoint.combining_ids[0]]) or\+ len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)1"? Your test might catch something that isn't based on a 'letter'(according to is_plain_letter). Otherwise this looks pretty good tome. Please add it to the next commitfest.Thanks for confirm, sir.I will add it to the next CF soon.Sorry for lately response. I attach the update patch.---Thanks and best regards,Dang Minh Huong
Re: [HACKERS] Extra Vietnamese unaccent rules
> On May 29, 29 Heisei, at 10:47, Thomas Munro > wrote: > > On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong wrote: >> Thanks for reporting and lecture about unicode. >> I attached a patch as the instruction from Thomas. Could you confirm it. > > - is_plain_letter(table[codepoint.combining_ids[0]]) and \ > + (is_plain_letter(table[codepoint.combining_ids[0]]) or\ > +len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \ > > Shouldn't you use "or is_letter_with_marks()", instead of "or len(...) >> 1"? Your test might catch something that isn't based on a 'letter' > (according to is_plain_letter). Otherwise this looks pretty good to > me. Please add it to the next commitfest. Thanks for confirm, sir. I will add it to the next CF soon. > I expect that some users in Vietnam will consider this to be a bugfix, > which raises the question of whether to backpatch it. Perhaps we > could consider fixing it for 10. Then users of older versions could > grab the rules file from 10 to use with 9.whatever if they want to do > that and reindex their data as appropriate. I am also inclined to the fixing it for 10, because it will not affect to current users. But do you want to back-patch to all supported versions Kha Nguyen? # I would also want to note that, not only Vietnamese characters were missed to add from the rule list. --- Thanks and best regards, Dang Minh Huong
Re: [HACKERS] Extra Vietnamese unaccent rules
On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong wrote: > [Quoting Thomas] >> You don't have to worry about decoding that line, it's all done in >> that Python script. The problem is just in the function >> is_letter_with_marks(). Instead of just checking if combining_ids[0] >> is a plain letter, it looks like it should also check if >> combining_ids[0] itself is a letter with marks. Also get_plain_letter >> would need to be able to recurse to extract the "a". > > Thanks for reporting and lecture about unicode. > I attached a patch as the instruction from Thomas. Could you confirm it. - is_plain_letter(table[codepoint.combining_ids[0]]) and \ + (is_plain_letter(table[codepoint.combining_ids[0]]) or\ +len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \ Shouldn't you use "or is_letter_with_marks()", instead of "or len(...) > 1"? Your test might catch something that isn't based on a 'letter' (according to is_plain_letter). Otherwise this looks pretty good to me. Please add it to the next commitfest. I expect that some users in Vietnam will consider this to be a bugfix, which raises the question of whether to backpatch it. Perhaps we could consider fixing it for 10. Then users of older versions could grab the rules file from 10 to use with 9.whatever if they want to do that and reindex their data as appropriate. > [Quoting Michael] >> Actually, with the recent work that has been done with >> unicode_norm_table.h which has been to transpose UnicodeData.txt into >> user-friendly tables, shouldn't the python script of unaccent/ be >> replaced by something that works on this table? This does a canonical >> decomposition but just keeps the first characters with a class >> ordering of 0. So we have basic APIs able to look at UnicodeData.txt >> and let caller do decision making with the result returned. > > Thanks, i will learning about it. It seems like that could be useful for runtime use (I'm sure there is a whole world of Unicode support we could add), but here we're only trying to generate a mapping file to add to the source tree, so I'm not sure how it's relevant. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Hi, unaccent.patch Description: Binary data I am interested in this thread.On May 27, 29 Heisei, at 10:41, Michael Paquierwrote:On Fri, May 26, 2017 at 5:48 PM, Thomas Munro wrote:Unicode has two ways to represent characters with accents: either withcomposed codepoints like "é" or decomposed codepoints where you say"e" and then "´". The field "00E2 0301" is the decomposed form ofthat character above. Our job here is to identify the basic letterthat each composed character contains, by analysing the decomposedfield that you see in that line. I failed to realise that characterswith TWO accents are described as a composed character with ONE accentplus another accent.Doesn't that depend on the NF operation you are working on? With acanonical decomposition it seems to me that a character with twoaccents can as well be decomposed with one character and two composingcharacter accents (NFKC does a canonical decomposition in one of itssteps).You don't have to worry about decoding that line, it's all done inthat Python script. The problem is just in the functionis_letter_with_marks(). Instead of just checking if combining_ids[0]is a plain letter, it looks like it should also check ifcombining_ids[0] itself is a letter with marks. Also get_plain_letterwould need to be able to recurse to extract the "a".Thanks for reporting and lecture about unicode.I attached a patch as the instruction from Thomas. Could you confirm it.Actually, with the recent work that has been done withunicode_norm_table.h which has been to transpose UnicodeData.txt intouser-friendly tables, shouldn't the python script of unaccent/ bereplaced by something that works on this table? This does a canonicaldecomposition but just keeps the first characters with a classordering of 0. So we have basic APIs able to look at UnicodeData.txtand let caller do decision making with the result returned.-- MichaelThanks, i will learning about it.---Dang Minh Huong
Re: [HACKERS] Extra Vietnamese unaccent rules
Does this mean that the python script has to be updated to be recursive too? > On 27 May 2017, at 0.48, Thomas Munro wrote: > > On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen wrote: >> Could you explain to me what this line means: >> “ >> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 >> 0301N;;;1EA4;;1EA4 >> “ >> >> If you could give me an example of adding a rule for “recursive” case, I can >> do the rest. I am not familiar with this unaccent format generation yet. > > So contrib/unaccent/generate_unaccent_rules.py is a Python script that > takes UnicodeData.txt, a list of information about all Unicode > codepoints available at a URL that is shown in a comment, and > generates unaccent.rules. The idea was to avoid having to change it > manually every time someone finds characters that should be in there > (as you have just done!) by doing it systematically. > > Unicode has two ways to represent characters with accents: either with > composed codepoints like "é" or decomposed codepoints where you say > "e" and then "´". The field "00E2 0301" is the decomposed form of > that character above. Our job here is to identify the basic letter > that each composed character contains, by analysing the decomposed > field that you see in that line. I failed to realise that characters > with TWO accents are described as a composed character with ONE accent > plus another accent. > > You don't have to worry about decoding that line, it's all done in > that Python script. The problem is just in the function > is_letter_with_marks(). Instead of just checking if combining_ids[0] > is a plain letter, it looks like it should also check if > combining_ids[0] itself is a letter with marks. Also get_plain_letter > would need to be able to recurse to extract the "a". > > I hope that helps! > > -- > Thomas Munro > http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Could you explain to me what this line means: “ 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 0301N;;;1EA4;;1EA4 “ If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this unaccent format generation yet. Thanks Kha > On 26 May 2017, at 21.19, Thomas Munro wrote: > > On Sat, May 27, 2017 at 5:13 AM, Tom Lane wrote: >> I wrote: >>> Nguyen Le Hoang Kha writes: Most of the time in Vietnamese language, there are up to 2 accents in a character. These unaccent rules are added to handle such cases (which are very common). >> >>> I can't see any reason not to add these --- any objections out there? >> >> Oh, wait a minute. Patching unaccent.rules directly isn't the way >> to do this; that file is supposed to be generated by >> generate_unaccent_rules.py. Can you see how to modify that script >> to produce these rules? > > Looking at one example from this patch: > > UTF8: > Codepoint: 1EA5 > Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE > > In UnicodData.txt it's this line: > > 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 > 0301N;;;1EA4;;1EA4 > > The problem is that generate_unaccent_rules.py assumes that the > composing data is a plain letter followed by some number of > diacritical modifiers. That's true for the characters with a single > accent, but in this multi-accent case it's *composed* character 00E2 > (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301 > (COMBINING ACCENT ACUTE). So we need to teach it to be recursive. > > -- > Thomas Munro > http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro wrote: > Unicode has two ways to represent characters with accents: either with > composed codepoints like "é" or decomposed codepoints where you say > "e" and then "´". The field "00E2 0301" is the decomposed form of > that character above. Our job here is to identify the basic letter > that each composed character contains, by analysing the decomposed > field that you see in that line. I failed to realise that characters > with TWO accents are described as a composed character with ONE accent > plus another accent. Doesn't that depend on the NF operation you are working on? With a canonical decomposition it seems to me that a character with two accents can as well be decomposed with one character and two composing character accents (NFKC does a canonical decomposition in one of its steps). > You don't have to worry about decoding that line, it's all done in > that Python script. The problem is just in the function > is_letter_with_marks(). Instead of just checking if combining_ids[0] > is a plain letter, it looks like it should also check if > combining_ids[0] itself is a letter with marks. Also get_plain_letter > would need to be able to recurse to extract the "a". Actually, with the recent work that has been done with unicode_norm_table.h which has been to transpose UnicodeData.txt into user-friendly tables, shouldn't the python script of unaccent/ be replaced by something that works on this table? This does a canonical decomposition but just keeps the first characters with a class ordering of 0. So we have basic APIs able to look at UnicodeData.txt and let caller do decision making with the result returned. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen wrote: > Could you explain to me what this line means: > “ > 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 > 0301N;;;1EA4;;1EA4 > “ > > If you could give me an example of adding a rule for “recursive” case, I can > do the rest. I am not familiar with this unaccent format generation yet. So contrib/unaccent/generate_unaccent_rules.py is a Python script that takes UnicodeData.txt, a list of information about all Unicode codepoints available at a URL that is shown in a comment, and generates unaccent.rules. The idea was to avoid having to change it manually every time someone finds characters that should be in there (as you have just done!) by doing it systematically. Unicode has two ways to represent characters with accents: either with composed codepoints like "é" or decomposed codepoints where you say "e" and then "´". The field "00E2 0301" is the decomposed form of that character above. Our job here is to identify the basic letter that each composed character contains, by analysing the decomposed field that you see in that line. I failed to realise that characters with TWO accents are described as a composed character with ONE accent plus another accent. You don't have to worry about decoding that line, it's all done in that Python script. The problem is just in the function is_letter_with_marks(). Instead of just checking if combining_ids[0] is a plain letter, it looks like it should also check if combining_ids[0] itself is a letter with marks. Also get_plain_letter would need to be able to recurse to extract the "a". I hope that helps! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On Sat, May 27, 2017 at 5:13 AM, Tom Lane wrote: > I wrote: >> Nguyen Le Hoang Kha writes: >>> Most of the time in Vietnamese language, there are up to 2 accents in a >>> character. These unaccent rules are added to handle such cases (which are >>> very common). > >> I can't see any reason not to add these --- any objections out there? > > Oh, wait a minute. Patching unaccent.rules directly isn't the way > to do this; that file is supposed to be generated by > generate_unaccent_rules.py. Can you see how to modify that script > to produce these rules? Looking at one example from this patch: UTF8: Codepoint: 1EA5 Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE In UnicodData.txt it's this line: 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 0301N;;;1EA4;;1EA4 The problem is that generate_unaccent_rules.py assumes that the composing data is a plain letter followed by some number of diacritical modifiers. That's true for the characters with a single accent, but in this multi-accent case it's *composed* character 00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301 (COMBINING ACCENT ACUTE). So we need to teach it to be recursive. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
I wrote: > Nguyen Le Hoang Kha writes: >> Most of the time in Vietnamese language, there are up to 2 accents in a >> character. These unaccent rules are added to handle such cases (which are >> very common). > I can't see any reason not to add these --- any objections out there? Oh, wait a minute. Patching unaccent.rules directly isn't the way to do this; that file is supposed to be generated by generate_unaccent_rules.py. Can you see how to modify that script to produce these rules? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
Nguyen Le Hoang Kha writes: > Most of the time in Vietnamese language, there are up to 2 accents in a > character. These unaccent rules are added to handle such cases (which are > very common). I can't see any reason not to add these --- any objections out there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Extra Vietnamese unaccent rules
Most of the time in Vietnamese language, there are up to 2 accents in a character. These unaccent rules are added to handle such cases (which are very common). Kha Nguyen | nlhkh@github vietnamese-unaccent-rules.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers