Re: [HACKERS] [PROPOSAL] Text search configuration extension
Hello, On Fri, Aug 18, 2017 at 03:30:38PM +0300, Aleksandr Parfenov wrote: > Hello hackers! > > I'm working on a new approach in text search configuration and want to > share my thought with community in order to get some feedback and maybe > some new ideas. > There are several cases, where the new syntax could be useful: https://www.postgresql.org/message-id/4733b65a.9030...@students.mimuw.edu.pl Firstly check is lexeme stopword or not, and only then normalize it. https://www.postgresql.org/message-id/c6851b7e-da25-3d8e-a5df-022c395a11b4%40postgrespro.ru Support union of outputs of several dictionaries. https://www.postgresql.org/message-id/46D57E6F.8020009%40enterprisedb.com Support of chain of dictionaries using MAP BY operator. The basic idea of the approach is to bring to a user more control of text search configurations without writing additional or modifing existing dictionaries. > ALTER TEXT SEARCH CONFIGURATION en_de_search ADD MAPPING FOR asciiword, > word WITH > CASE >WHEN english_hunspell IS NOT NULL THEN english_hunspell >WHEN german_hunspell IS NOT NULL THEN german_hunspell >ELSE > -- stem dictionaries can't be used for language detection > english_stem UNION german_stem > END; For example, the configuration mentioned above will bring the following results: =# select d @@ q, d, q from to_tsvector('german_hunspell', 'Dieser Hund wollte ihn jedoch nicht nach Hause begleiten') d, to_tsquery('en_de_search', 'hause') q; ?column? | d |q --+--+-- t| 'begleiten':9 'hausen':8 'hund':2 'jedoch':5 | 'hausen' (1 row) This configuration is useful when a query language is unknown. Best regards, -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PROPOSAL] Text search configuration extension
Hello hackers! I'm working on a new approach in text search configuration and want to share my thought with community in order to get some feedback and maybe some new ideas. Nowadays we can't configure text search engine in Postgres for some useful scenarios such as multi-language search or exact and morphological search in one configuration. Additionally, we can't use dictionaries as a filter-dictionary if it wasn't taken into consideration during dictionary development. Also I think to split result set building configuration and command selection configuration. The last but not the least goal is to keep backward compatibility in terms of syntax and behavior in currently available scenarios. In order to meet mentioned goals I propose following syntax for text search configurations (current syntax could be used as well): ALTER TEXT SEARCH CONFIGURATION ADD/ALTER MAPPING FOR WITH CASE WHEN THEN <...> [ELSE ] END; A is an expression with dictionary names used as operands and boolean operators AND, OR and NOT. Additionally, after dictionary name there could be options for result check IS [NOT] NULL or IS [NOT] STOP. If there is no check-options for a dictionary, it will be evaluated as: dict IS NOT NULL and dict IS NOT STOP A is an expression on sets of lexemes with support of operators UNION, EXCEPT, INTERSECT and MAP BY. A MAP BY operator is a way to configure filter-dictionaries, so the output of the righthand subexpression used as an input of lefthand subexpression. In other words, MAP BY operator used instead of TSL_FILTER flagged output. An example of configuration for both English and German search: ALTER TEXT SEARCH CONFIGURATION en_de_search ADD MAPPING FOR asciiword, word WITH CASE WHEN english_hunspell IS NOT NULL THEN english_hunspell WHEN german_hunspell IS NOT NULL THEN german_hunspell ELSE -- stem dictionaries can't be used for language detection english_stem UNION german_stem END; And example with unaccent: ALTER TEXT SEARCH CONFIGURATION german_unaccent ADD MAPPING FOR asciiword, word WITH CASE WHEN german_hunspell IS NOT NULL THEN german_hunspell MAP BY unaccent ELSE german_stem MAP BY unaccent END; In the last example the input for german_hunspell is replaced by output of the unaccent if it is not NULL. If dictionary returns more than one lexeme, each lexeme processed independently. I'm not sure should we provide ability to use MAP BY operator in condition, since MAP BY operates on sets and condition is a boolean expression. I think to allow this with restriction on obligatory place it inside parenthesis with check-options. Something like: (german_hunspell MAP BY unaccent) IS NOT NULL Because this type of check can be useful in some situations, but we should isolate set-related subexpression. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers