Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Thanks, Bruce ! Oleg On Thu, 13 Aug 2009, Bruce Momjian wrote: Peter Eisentraut wrote: On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote: Oleg Bartunov wrote: Peter, how to write accented characters in sgml ? Is't not allowed to write them as is ? aacute; for ?, etc. You can't use characters that aren't in Latin-1 I think. Writing them literally is not allowed. It's somehow possible, but it's not as straightforward as say with XML. And you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed in my limited testing the other day. The top of release.sgml has instructions on that because that is often something we need to do for names in release notes: non-ASCII charactersconvert to HTML4 entity () escapes official: http://www.w3.org/TR/html4/sgml/entities.html one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references we cannot use UTF8 because SGML Docbook does not support it http://www.pemberley.com/janeinfo/latin1.html#latexta Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Tue, Aug 11, 2009 at 4:31 AM, Peter Eisentrautpete...@gmx.net wrote: On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote: try to build the docs to see how to properly test this and seems like you have to teach contrib.sgml and bookindex.sgml about dict-unaccent... and when i did that i got this: openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128 openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129 openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130 openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131 openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132 openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133 openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134 openjade:dict-unaccent.sgml:116:4:E: element B undefined make: *** [HTML.index] Error 1 make: *** Se borra el archivo `HTML.index' You should escape the special characters as well as the b that appears as part of the example output using character entitities (amp; etc.). Sounds like this patch needs a little bit of doc adjustment per the above and is then ready for committer? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Peter, how to write accented characters in sgml ? Is't not allowed to write them as is ? Oleg On Tue, 11 Aug 2009, Peter Eisentraut wrote: On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote: try to build the docs to see how to properly test this and seems like you have to teach contrib.sgml and bookindex.sgml about dict-unaccent... and when i did that i got this: openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128 openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129 openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130 openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131 openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132 openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133 openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134 openjade:dict-unaccent.sgml:116:4:E: element B undefined make: *** [HTML.index] Error 1 make: *** Se borra el archivo `HTML.index' You should escape the special characters as well as the b that appears as part of the example output using character entitities (amp; etc.). Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Oleg Bartunov wrote: Peter, how to write accented characters in sgml ? Is't not allowed to write them as is ? aacute; for á, etc. You can't use characters that aren't in Latin-1 I think. Writing them literally is not allowed. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote: Oleg Bartunov wrote: Peter, how to write accented characters in sgml ? Is't not allowed to write them as is ? aacute; for á, etc. You can't use characters that aren't in Latin-1 I think. Writing them literally is not allowed. It's somehow possible, but it's not as straightforward as say with XML. And you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed in my limited testing the other day. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Peter Eisentraut wrote: On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote: Oleg Bartunov wrote: Peter, how to write accented characters in sgml ? Is't not allowed to write them as is ? aacute; for ?, etc. You can't use characters that aren't in Latin-1 I think. Writing them literally is not allowed. It's somehow possible, but it's not as straightforward as say with XML. And you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed in my limited testing the other day. The top of release.sgml has instructions on that because that is often something we need to do for names in release notes: non-ASCII charactersconvert to HTML4 entity () escapes official: http://www.w3.org/TR/html4/sgml/entities.html one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references we cannot use UTF8 because SGML Docbook does not support it http://www.pemberley.com/janeinfo/latin1.html#latexta -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote: try to build the docs to see how to properly test this and seems like you have to teach contrib.sgml and bookindex.sgml about dict-unaccent... and when i did that i got this: openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128 openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129 openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130 openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131 openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132 openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133 openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134 openjade:dict-unaccent.sgml:116:4:E: element B undefined make: *** [HTML.index] Error 1 make: *** Se borra el archivo `HTML.index' You should escape the special characters as well as the b that appears as part of the example output using character entitities (amp; etc.). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Fri, Aug 7, 2009 at 10:44 AM, Jaime Casanovajcasa...@systemguards.com.ec wrote: On Thu, Aug 6, 2009 at 10:46 PM, Robert Haasrobertmh...@gmail.com wrote: I am not sure whether this has been formally reviewed by anyone yet; do we think it's Ready for Committer? i was trying to make some review of this but besides that it compiles fine and passes regression tests doesn't know how to test it try to build the docs to see how to properly test this and seems like you have to teach contrib.sgml and bookindex.sgml about dict-unaccent... and when i did that i got this: openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128 openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129 openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130 openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131 openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132 openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133 openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134 openjade:dict-unaccent.sgml:116:4:E: element B undefined make: *** [HTML.index] Error 1 make: *** Se borra el archivo `HTML.index' -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Thu, Aug 6, 2009 at 10:46 PM, Robert Haasrobertmh...@gmail.com wrote: I am not sure whether this has been formally reviewed by anyone yet; do we think it's Ready for Committer? i was trying to make some review of this but besides that it compiles fine and passes regression tests doesn't know how to test it -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Isn't that function leaking res pointer? Also, I'm curious why you're fixed allocating 2*sizeof(TSLexeme) in unaccent_lexize ... That's is a dictionary's interface part: lexize returns an array of TSLexeme and last structure should have lexeme field NULL. filter_dictionary file is not changed, it's attached only for consistency. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ unaccent-0.6.gz Description: Unix tar archive filter_dictionary-0.1.gz Description: Unix tar archive -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
2009/8/6 Teodor Sigaev teo...@sigaev.ru: Isn't that function leaking res pointer? Also, I'm curious why you're fixed allocating 2*sizeof(TSLexeme) in unaccent_lexize ... That's is a dictionary's interface part: lexize returns an array of TSLexeme and last structure should have lexeme field NULL. filter_dictionary file is not changed, it's attached only for consistency. I am not sure whether this has been formally reviewed by anyone yet; do we think it's Ready for Committer? Thanks, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Sat, Aug 1, 2009 at 12:35 AM, Alvaro Herreraalvhe...@commandprompt.com wrote: Teodor Sigaev wrote: As for the contrib module, I think it could use a lot more function header comments! Also, it would be great if it could be used separately from tsearch, i.e. that it provided a function unaccent(text) returns text that unaccented arbitrary strings (I guess it would use the default tsconfig). Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions. Sorry, I failed to notice. Looks good. Isn't that function leaking res pointer? Also, I'm curious why you're allocating 2*sizeof(TSLexeme) in unaccent_lexize ... So are we waiting for an updated version of this patch? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Teodor Sigaev wrote: As for the contrib module, I think it could use a lot more function header comments! Also, it would be great if it could be used separately from tsearch, i.e. that it provided a function unaccent(text) returns text that unaccented arbitrary strings (I guess it would use the default tsconfig). Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions. Sorry, I failed to notice. Looks good. Isn't that function leaking res pointer? Also, I'm curious why you're allocating 2*sizeof(TSLexeme) in unaccent_lexize ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Tuesday 14 July 2009 22:12:28 Oleg Bartunov wrote: we'd like to introduce filtering dictionaries support for text search and new contrib module unaccent, which provides useful example of filtering dictionary. It finally solves the known problem of incorrect generation of headlines of text with accents. What is the source of the unaccent rules, and how complete is the rule set? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
On Wed, 29 Jul 2009, Peter Eisentraut wrote: On Tuesday 14 July 2009 22:12:28 Oleg Bartunov wrote: we'd like to introduce filtering dictionaries support for text search and new contrib module unaccent, which provides useful example of filtering dictionary. It finally solves the known problem of incorrect generation of headlines of text with accents. What is the source of the unaccent rules, and how complete is the rule set? unicode tables from unicode.org. It'be nice if someone check the completeness. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
I'm curious about the pg_regress change ... is it really necessary? To test unaccent dictionary it's needed to input accented characters, not all encodings allow that. UTF8 allows that, but it doesn't compatible with a lot of locales. So, --no-locale should be propagated to CREATE DATABASE command as it's done for encoding. AFAICS the changes to the core code are very small; I wonder if you should commit it separately i.e. without the contrib module, and add the that one in another commit. Split patch to two parts: filter_dictionary-0.1.gz - core changes, including pg_regress changes unaccent-0.5.gz - contrib module Also, I added some comments into code and did cosmetic changes in docs. As for the contrib module, I think it could use a lot more function header comments! Also, it would be great if it could be used separately from tsearch, i.e. that it provided a function unaccent(text) returns text that unaccented arbitrary strings (I guess it would use the default tsconfig). Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ unaccent-0.5.gz Description: Unix tar archive filter_dictionary-0.1.gz Description: Unix tar archive -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Filtering dictionaries support and unaccent dictionary
Oleg Bartunov wrote: Hi, we'd like to introduce filtering dictionaries support for text search and new contrib module unaccent, which provides useful example of filtering dictionary. It finally solves the known problem of incorrect generation of headlines of text with accents. I'm curious about the pg_regress change ... is it really necessary? AFAICS the changes to the core code are very small; I wonder if you should commit it separately i.e. without the contrib module, and add the that one in another commit. As for the contrib module, I think it could use a lot more function header comments! Also, it would be great if it could be used separately from tsearch, i.e. that it provided a function unaccent(text) returns text that unaccented arbitrary strings (I guess it would use the default tsconfig). -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Filtering dictionaries support and unaccent dictionary
Hi there, we'd like to introduce filtering dictionaries support for text search and new contrib module unaccent, which provides useful example of filtering dictionary. It finally solves the known problem of incorrect generation of headlines of text with accents. Also, this module provides unaccent() functions, which is a simple wrapper on unaccent dictionary. Regards, Oleg PS. I hope it's not late for July commitfest ! _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 unaccent.gz Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers