Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements
On 13.03.2017 13:01, Anastasia Lubennikova wrote: Thanks for catching that. It was caused by a conflict on applying of the patch. Updated versions of both patches are attached. I think the code is good and the patches are small. Documentation is updated by the patches. All regression tests are passed. Marked the patch as "Ready for Commiter". -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements
On 15.02.2017 20:54, Anastasia Lubennikova wrote: Done. I have gotten the error that AlterUserMappingStmt doesn't have if_not_exists (in Russian): gram.y: В функции «base_yyparse»: gram.y:4918:7: ошибка: «AlterUserMappingStmt {aka struct AlterUserMappingStmt}» не содержит элемента с именем «if_not_exists» n->if_not_exists = false; ^~ After applying the CREATE USER patch in gram.y I have: AlterUserMappingStmt: ALTER USER MAPPING FOR auth_ident SERVER name alter_generic_options { AlterUserMappingStmt *n = makeNode(AlterUserMappingStmt); n->user = $5; n->servername = $7; n->options = $8; n->if_not_exists = false; $$ = (Node *) n; } | CREATE USER MAPPING IF_P NOT EXISTS FOR auth_ident SERVER name create_generic_options { CreateUserMappingStmt *n = makeNode(CreateUserMappingStmt); n->user = $8; n->servername = $10; n->options = $11; n->if_not_exists = true; $$ = (Node *) n; } ; Here ALTER USER MAPPING and CREATE USER MAPPING commands were mixed. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On 10.03.2017 04:00, Michael Paquier wrote: On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov wrote: I think this is good fixes. I've checked them. And in my opinion they are correct. The code also is good. Having something with conflicts is not nice, so attached is a rebased version. Thank you! I've rerun regression and TAP tests. They all passed. Also maybe it will be good to fix comments. In buf_internals.h: #define BM_PERMANENT(1U << 31)/* permanent relation (not * unlogged) */ And in FlushBuffer(): /* * Force XLOG flush up to buffer's LSN. This implements the basic WAL * rule that log updates must hit disk before any of the data-file changes * they describe do. * * However, this rule does not apply to unlogged relations, which will be * lost after a crash anyway. Most unlogged relation pages do not bear Because BM_PERMANENT is used for init forks of unlogged indexes now. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
Hello, I wanted to review the patch. But the patch is applied with errors. I've rebased the local copy and have done review on it. I'm not sure is it properly to send rebased patch by reviewer, so I haven't sent it to avoid confuses. On 29.01.2017 17:00, Michael Paquier wrote: Attached is what I have in mind for HEAD. btree, gist, spgist and bloom indexes are changed so as the init forks created go through the shared buffers instead of having their empty() routines handle the flush of the page created. This removes any kind of race conditions between the checkpointer and the init fork creations, which is I think a good thing. I think this is good fixes. I've checked them. And in my opinion they are correct. The code also is good. Here are the tests I have done. First running those commands to create all types of indexes. create extension bloom; create extension btree_gist; create extension btree_gin; create unlogged table foo (a int); create index foo_bt on foo(a); create index foo_bloom on foo using bloom(a); create index foo_gin on foo using gin (a); create index foo_gist on foo using gist (a); create index foo_brin on foo using brin (a); create unlogged table foo_geo (a box); create index foo_spgist ON foo_geo using spgist(a); checkpoint; Then crash the server, restart it, and the following vacuums are able to complete. vacuum foo; vacuum foo_geo; I've done this tests. Before the patch server crashes on vacuum command. After applying the patch server doesn't crash on vacuum command. I have run regression and TAP tests. They all passed without error. I think the patch can be marked as "Ready for Committer" after rebase. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
On 15.02.2017 15:26, Amul Sul wrote: The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Review of v7 patch: - Patch applies to the top of master HEAD cleanly & make check-world also fine. - Tom Lane's review comments are fixed. The new status of this patch is: Ready for Committer Thank you for your review! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generic type subscription
> Yes, but it was related to the idea of having `ArrayRef` and `JsonbRef` nodes > for specific types. Since now there is generic `SubscriptingRef` node, I think > it should be ok. Sorry I misunderstood it. > Just to be clear - as far as I understood, these compilation problems were > caused not because the extension knew something about ArrayRef node in > particular, but because the extension tried to extract all nodes to generate > code from them. It means any change will require "refetching", so I think it's > natural for this extension. Agree. It will be hard to maintain both nodes. And it is not so smart to have two nodes ArrayRef (deprecated) and SubscriptingRef. It is not hard to replace ArrayRef node with SubscriptingRef in other extensions. There is a little note: > #include "utils/lsyscache.h" > +#include "utils/syscache.h" > #include "utils/memutils.h" I think "utils/syscache.h" isn't necessary here. PostgreSQL could be compiled without this include. I suppose that this patch can be marked as "Ready for commiter". Any opinions? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generic type subscription
2016-12-27 14:42 GMT+05:00 Dmitry Dolgov <9erthali...@gmail.com>: >> On 27 December 2016 at 16:09, Aleksander Alekseev >> wrote: >> until it breaks existing extensions. > > Hm...I already answered, that I managed to avoid compilation problems for > this particular extension > using the `genparser` command again: I suppose that a separate node type could solve it. But I'm not convinced about how to distinguish ArrayRef node with new SubscriptingRef node. Maybe it could be done in the transformIndirection() function. If I understand all correctly. Also Tom pointed that he had bad experience with using ArrayRef node: https://www.postgresql.org/message-id/518.1439846343%40sss.pgh.pa.us > No. Make a new expression node type. > > (Salesforce did something similar for an internal feature, and it was a > disaster both for code modularity and performance. We had to change it to > a separate node type, which I just got finished doing. Don't go down that > path. While you're at it, I'd advise that fetch and assignment be two > different node types rather than copying ArrayRef's bad precedent of using > only one.) -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generic type subscription
2016-12-26 18:49 GMT+03:00 Dmitry Dolgov <9erthali...@gmail.com>: >> On 5 December 2016 at 12:03, Haribabu Kommi >> wrote: > >> Moved to next CF with "needs review" status. > > Looks like we stuck here little bit. Does anyone else have any > suggestions/improvements, or this patch is in good enough shape? Would you rebase the patch, please? It seems it is necessary. It can't be applied now. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION
Thank you for your comments, Stephen. 2016-12-21 20:34 GMT+03:00 Stephen Frost : > > Did you happen to look at adding a regression test for this to > test_ddl_deparse? Of course. I updated the patch. > >> This patch only fixes the bug. But I think I also can do a patch which >> will give pg_ts_config_map entries with >> pg_event_trigger_ddl_commands() if someone wants. It can be done using >> new entry in the CollectedCommandType structure maybe. > > While that sounds like a good idea, it seems like it's more a feature > addition rather than a bugfix, no? > Yes, agree with you. It should be added as a separate patch. I think I will deal with it. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c index b24011371c..2b84848cf5 100644 --- a/src/backend/commands/tsearchcmds.c +++ b/src/backend/commands/tsearchcmds.c @@ -1215,10 +1215,10 @@ AlterTSConfiguration(AlterTSConfigurationStmt *stmt) /* Update dependencies */ makeConfigurationDependencies(tup, true, relMap); - InvokeObjectPostAlterHook(TSConfigMapRelationId, + InvokeObjectPostAlterHook(TSConfigRelationId, HeapTupleGetOid(tup), 0); - ObjectAddressSet(address, TSConfigMapRelationId, cfgId); + ObjectAddressSet(address, TSConfigRelationId, cfgId); heap_close(relMap, RowExclusiveLock); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index fd4eff4907..b7a4c8e531 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1479,7 +1479,8 @@ ProcessUtilitySlow(ParseState *pstate, break; case T_AlterTSConfigurationStmt: -address = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); +AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); +commandCollected = true; break; case T_AlterTableMoveAllStmt: diff --git a/src/test/modules/test_ddl_deparse/Makefile b/src/test/modules/test_ddl_deparse/Makefile index 8ea6f39afd..3a57a95c84 100644 --- a/src/test/modules/test_ddl_deparse/Makefile +++ b/src/test/modules/test_ddl_deparse/Makefile @@ -23,6 +23,7 @@ REGRESS = test_ddl_deparse \ comment_on \ alter_function \ alter_sequence \ + alter_ts_config \ alter_type_enum \ opfamily \ defprivs \ diff --git a/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out new file mode 100644 index 00..afc352fc5f --- /dev/null +++ b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out @@ -0,0 +1,8 @@ +-- +-- ALTER TEXT SEARCH CONFIGURATION +-- +CREATE TEXT SEARCH CONFIGURATION en (copy=english); +NOTICE: DDL test: type simple, tag CREATE TEXT SEARCH CONFIGURATION +ALTER TEXT SEARCH CONFIGURATION en + ALTER MAPPING FOR host, email, url, sfloat WITH simple; +NOTICE: DDL test: type alter text search configuration, tag ALTER TEXT SEARCH CONFIGURATION diff --git a/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql new file mode 100644 index 00..ac13e21dda --- /dev/null +++ b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql @@ -0,0 +1,8 @@ +-- +-- ALTER TEXT SEARCH CONFIGURATION +-- + +CREATE TEXT SEARCH CONFIGURATION en (copy=english); + +ALTER TEXT SEARCH CONFIGURATION en + ALTER MAPPING FOR host, email, url, sfloat WITH simple; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
Hello Tom, On 17.12.2016 21:36, Tom Lane wrote: 4. The transformations are wrong anyway. The OR case I showed above is all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND case is not: regression=# select 'a <-> (b & c)'::tsquery; tsquery --- 'a' <-> 'b' & 'a' <-> 'c' (1 row) This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be matched at different places in that text; but it seems highly unlikely to me that that's what the writer of such a query wanted. (If she did want that, she would write it that way to start with.) NOT is not very nice either: If I'm not mistaken PostgreSQL 9.6 and master with patch "fix-phrase-search.patch" return false for the query: select 'a b a c' @@ 'a <-> (b & c)'::tsquery; ?column? -- f (1 row) I agree that such query is confusing. Maybe it is better to return true for such queries? Otherwise it seems that queries like 'a <-> (b & c)' will always return false. Then we need maybe some warning message. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Crash in tsquery_rewrite/QTNBinary
Hi, 2016-12-07 9:06 GMT+03:00 Andreas Seltenreich : > Hi, > > the following query crashes master as of 4212cb7. > > select ts_rewrite( > tsquery_phrase( > tsquery $$'sanct' & 'peter'$$, > tsquery $$'5' <-> '6'$$, > 42), > tsquery $$'5' <-> '6'$$, > plainto_tsquery('I') ); > This happens also for queries: select ts_rewrite( to_tsquery('5 & (6 | 5)'), to_tsquery('5'), to_tsquery('I')); select ts_rewrite( to_tsquery('5 & 6 <-> 5'), to_tsquery('5'), to_tsquery('I')); It happens because 'I' is stop word and substitute query becomes empty. And for queries above we need recursive dropvoidsubtree() function. Without this patch this function cleans only first level of tree. And query above becomes: '6 | void'. Firstly I made recursive dropvoidsubtree(). But attached patch cleans query tree in dofindsubquery() to avoid extra tree scan. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/utils/adt/tsquery_rewrite.c b/src/backend/utils/adt/tsquery_rewrite.c index e5bed6e..00174bd 100644 --- a/src/backend/utils/adt/tsquery_rewrite.c +++ b/src/backend/utils/adt/tsquery_rewrite.c @@ -186,6 +186,15 @@ findeq(QTNode *node, QTNode *ex, QTNode *subs, bool *isfind) * Recursive guts of findsubquery(): attempt to replace "ex" with "subs" * at the root node, and if we failed to do so, recursively match against * child nodes. + * + * Delete any void subtrees. In the following example '5' is replaced by empty + * operand: + * + *AND ->6 + * / \ + * 5PHRASE + * / \ + * 6 5 */ static QTNode * dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind) @@ -200,39 +209,20 @@ dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind) if (root && (root->flags & QTN_NOCHANGE) == 0 && root->valnode->type == QI_OPR) { - int i; - - for (i = 0; i < root->nchild; i++) - root->child[i] = dofindsubquery(root->child[i], ex, subs, isfind); - } - - return root; -} - -/* - * Delete any void subtrees that may have been inserted when the replacement - * subtree is void. - */ -static QTNode * -dropvoidsubtree(QTNode *root) -{ - if (!root) - return NULL; - - if (root->valnode->type == QI_OPR) - { int i, j = 0; for (i = 0; i < root->nchild; i++) { - if (root->child[i]) - { - root->child[j] = root->child[i]; + root->child[j] = dofindsubquery(root->child[i], ex, subs, isfind); + if (root->child[j]) j++; - } } + /* +* Delete any void subtrees that may have been inserted when the replacement +* subtree is void. +*/ root->nchild = j; if (root->nchild == 0) @@ -267,9 +257,6 @@ findsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind) root = dofindsubquery(root, ex, subs, &DidFind); - if (!subs && DidFind) - root = dropvoidsubtree(root); - if (isfind) *isfind = DidFind; diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out index 55d6a85..9c20aaf 100644 --- a/src/test/regress/expected/tsearch.out +++ b/src/test/regress/expected/tsearch.out @@ -1251,6 +1251,14 @@ SELECT ts_rewrite('5 <-> (6 | 8)', 'SELECT keyword, sample FROM test_tsquery'::t '5' <-> '7' | '5' <-> '8' (1 row) +-- Check empty substitute +SELECT ts_rewrite(to_tsquery('5 & 6 <-> 5'), to_tsquery('5'), to_tsquery('I')); +NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored + ts_rewrite + + '6' +(1 row) + SELECT keyword FROM test_tsquery WHERE keyword @> 'new'; keyword diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql index afd990e..8752344 100644 --- a/src/test/regress/sql/tsearch.sql +++ b/src/test/regress/sql/tsearch.sql @@ -418,6 +418,8 @@ SELECT ts_rewrite('1 & (2 <2> 3)', 'SELECT ke
Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION
2016-11-28 21:32 GMT+03:00 Robert Haas : > > You might need to add this patch to https://commitfest.postgresql.org/ > so that it doesn't get forgotten. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company I added the patch to https://commitfest.postgresql.org/12/895/ I added it to the next commitfest. Because we are in the end of the current commitfest. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: session server side variables
On 28.11.2016 10:42, Pavel Stehule wrote: next update - setattr, getattr functions are working now notes, comments? Regards Pavel It is interesting! Do you have plans to support also table variables? For example, like this: create type composite_type_2 as (a int, b text); create variable var7 composite_type_2; select insertvar('var7','(10,Hello world\, Hello world\, Hello world)'); select insertvar('var7','(1000,Hola, hola!)'); select * from getvar('var7'); a | b --+--- 10 | Hello world, Hello world, Hello world 1000 | Hola, hola! Or it is a bad idea? Or it is not related to this patch? We have the extension (https://github.com/postgrespro/pg_variables). And it supports table like variables. It shows better performance against temporary tables. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION
Thank you for answers. 2016-11-19 21:28 GMT+03:00 Michael Paquier : > On Thu, Nov 17, 2016 at 1:17 PM, Alvaro Herrera > wrote: >> It's a bug. You're right that we need to handle the object class >> somewhere. Perhaps I failed to realize that tsconfigs could get >> altered. > > It seems to me that the thing to be careful of here is how a new > OBJECT_TSCONFIGURATIONMAP should use ObjectAddress. It does not seem > that complicated, but it needs some work. > -- > Michael After some investigation it seems to me that OBJECT_TSCONFIGURATIONMAP can't be added for pg_ts_config_map. Because this catalog hasn't Oids. But this bug can be easily fixed (patch attached). I think in AlterTSConfiguration() TSConfigRelationId should be used instead of TSConfigMapRelationId. Secondly, in ProcessUtilitySlow() we can use commandCollected = true. Because configuration entry is added in EventTriggerCollectAlterTSConfig() into currentEventTriggerState->commandList. This patch only fixes the bug. But I think I also can do a patch which will give pg_ts_config_map entries with pg_event_trigger_ddl_commands() if someone wants. It can be done using new entry in the CollectedCommandType structure maybe. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c index b240113..2b84848 100644 --- a/src/backend/commands/tsearchcmds.c +++ b/src/backend/commands/tsearchcmds.c @@ -1215,10 +1215,10 @@ AlterTSConfiguration(AlterTSConfigurationStmt *stmt) /* Update dependencies */ makeConfigurationDependencies(tup, true, relMap); - InvokeObjectPostAlterHook(TSConfigMapRelationId, + InvokeObjectPostAlterHook(TSConfigRelationId, HeapTupleGetOid(tup), 0); - ObjectAddressSet(address, TSConfigMapRelationId, cfgId); + ObjectAddressSet(address, TSConfigRelationId, cfgId); heap_close(relMap, RowExclusiveLock); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index f50ce40..1217d1a 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1477,7 +1477,8 @@ ProcessUtilitySlow(ParseState *pstate, break; case T_AlterTSConfigurationStmt: -address = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); +AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); +commandCollected = true; break; case T_AlterTableMoveAllStmt: -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] Re: [PATCH] Tab completion for ALTER TYPE … RENAME VALUE …
Hello, 2016-09-12 16:16 GMT+03:00 Dagfinn Ilmari Mannsåker : > > I've added it to the 2016-11 commit fest: > https://commitfest.postgresql.org/11/795/ > > - ilmari I've tested your patch. Patch was applied to the master. It seems there is no need to rebase it. PostgreSQL was compiled without errors with the patch. I've tested the patch with type: CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple'); And the following completions work as expected: => ALTER TYPE rainbow RENAME ATTRIBUTE TO VALUE => ALTER TYPE rainbow RENAME VALUE 'blue''green' 'orange' 'purple' 'red' 'yellow' It seems that the patch can be commited without any fixes. So I marked it as "Ready for Committer". -- Sincerely, Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
Thank you for your comments, 2016-11-04 20:36 GMT+02:00 Tom Lane : > > Artur Zakirov writes: > > I attached new version of the patch, which fix is_char_separator() > > declaration too. > > I did some experimenting using > http://rextester.com/l/oracle_online_compiler > > > which makes it look a lot like Oracle treats separator characters in the > pattern the same as spaces (but I haven't checked their documentation to > confirm that). > > The proposed patch doesn't seem to me to be trying to follow > these Oracle behaviors, but I think there is very little reason for > changing any of this stuff unless we move it closer to Oracle. Previous versions of the patch doesn't try to follow all Oracle behaviors. It tries to fix Amul Sul's behaviors. Because I was confused by dealing with spaces and separators by Oracle to_timestamp() and there is not information about it in the Oracle documentation: https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements004.htm#g195443 I've thought better about it now and fixed the patch. Now parser removes spaces after and before fields and insists that count of separators in the input string should match count of spaces or separators in the formatting string (but in formatting string we can have more separators than in input string). > > Some other nitpicking: > > * I think the is-separator function would be better coded like > > static bool > is_separator_char(const char *str) > { > /* ASCII printable character, but not letter or digit */ > return (*str > 0x20 && *str < 0x7F && > !(*str >= 'A' && *str <= 'Z') && > !(*str >= 'a' && *str <= 'z') && > !(*str >= '0' && *str <= '9')); > } > > The previous way is neither readable nor remarkably efficient, and it > knows much more about the ASCII character set than it needs to. Fixed. > > * Don't forget the cast to unsigned char when using isspace() or other > functions. Fixed. > > * I do not see the reason for throwing an error here: > > + /* Previous character was a backslash */ > + if (in_backslash) > + { > + /* After backslash should go non-space character */ > + if (isspace(*str)) > + ereport(ERROR, > + > (errcode(ERRCODE_SYNTAX_ERROR), > +errmsg("invalid escape > sequence"))); > + in_backslash = false; > > Why shouldn't backslash-space be a valid quoting sequence? > Hm, truly. Fixed it. I've attached the fixed patch. -- Sincerely, Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2e64cc4..5a4e248 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6159,7 +6159,8 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); to_timestamp and to_date skip multiple blank spaces in the input string unless the FX option is used. For example, - to_timestamp('2000JUN', ' MON') works, but + to_timestamp('2000JUN', ' MON') and + to_timestamp('2000 JUN', 'MON') work, but to_timestamp('2000JUN', 'FX MON') returns an error because to_timestamp expects one space only. FX must be specified as the first item in @@ -6169,6 +6170,19 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); + to_timestamp and to_date don't + skip multiple printable non letter and non digit characters in the input + string, but skip them in the formatting string. For example, + to_timestamp('2000-JUN', '/MON') and + to_timestamp('2000/JUN', '//MON') work, but + to_timestamp('2000//JUN', '/MON') + returns an error because count of the "/" character in the input string + doesn't match count of it in the formatting string. + + + + + Ordinary text is allowed in to_char templates and will be output literally. You can put a substring in double quotes to force it to be interpreted as literal text diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index d4eaa50..d28ceec 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/back
Re: [HACKERS] [PATCH] Generic type subscription
Hello, Do you have an updated version of the patch? 2016-10-18 20:41 GMT+03:00 Dmitry Dolgov <9erthali...@gmail.com>: > > > > The term "subscription" is confusing me > > Yes, you're right. "container" is too general I think, so I renamed > everything > to "subscripting". > There is another occurrences of "subscription" including file names. Please fix them. Also I've sent a personal email, that we have performance degradation of SELECT queries: create table test ( a int2[], b int4[][][]); With patch: => insert into test (a[1:5], b[1:1][1:2][1:2]) select '{1,2,3,4,5}', '{{{0,0},{1,2}}}' from generate_series(1,10); Time: 936,285 ms => UPDATE test SET a[0] = '2'; Time: 1605,406 ms (00:01,605) => UPDATE test SET b[1:1][1:1][1:2] = '{113, 117}'; Time: 1603,076 ms (00:01,603) => explain analyze select a[1], b[1][1][1] from test; QUERY PLAN - Seq Scan on test (cost=0.00..3858.00 rows=10 width=6) (actual time=0.035..241.280 rows=10 loops=1) Planning time: 0.087 ms Execution time: 246.916 ms (3 rows) And without patch: => insert into test (a[1:5], b[1:1][1:2][1:2]) select '{1,2,3,4,5}', '{{{0,0},{1,2}}}' from generate_series(1,10); Time: 971,677 ms => UPDATE test SET a[0] = '2'; Time: 1508,262 ms (00:01,508) => UPDATE test SET b[1:1][1:1][1:2] = '{113, 117}'; Time: 1473,459 ms (00:01,473) => explain analyze select a[1], b[1][1][1] from test; QUERY PLAN ---- Seq Scan on test (cost=0.00..5286.00 rows=10 width=6) (actual time=0.024..98.475 rows=10 loops=1) Planning time: 0.081 ms Execution time: 105.055 ms (3 rows) -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] FTS Configuration option
On 13.10.2016 11:54, Emre Hasegeli wrote: Maybe also better to use -> instead of AND? AND would has another behaviour. I could create the following configuration: => ALTER TEXT SEARCH CONFIGURATION multi_conf ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH (german_ispell AND english_ispell) OR simple; which will return both german_ispell and english_ispell results. But I'm not sure that this is a good solution. I see you usecase for AND. It might indeed be useful. AND suits well to it. Maybe THEN can be the keyword instead of -> for pass the results to subsequent dictionaries. They are all reserved keywords. I guess it wouldn't be a problem to use them. I agree with THEN. It is better than using -> I think. I suppose it wouldn't be a problem too. I think it is necessary to fix gram.y and implement logic with OR, AND and THEN. Of course if this syntax will be implemented, old syntax with commas also should be maintained. Yes, we should definitely. The comma can be interpreted either one of the keywords depending on left hand side dictionary. I would be glad to review, if you develop this feature. Then I will develop it :). But I suppose I can do it a few days or weeks later, because I have other tasks with higher priority. BTW, I've already implemented USING option a few weeks before https://github.com/select-artur/postgres/tree/join_tsconfig . But of course it is not useful now. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FTS Configuration option
Thank you for sharing your thoughts! 2016-10-12 15:08 GMT+03:00 Emre Hasegeli : > However then the stemmer doesn't do a good job on those words, because > the changed characters are important for the language. What I really > needed was something like this: > >> ALTER TEXT SEARCH CONFIGURATION turkish >> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, >> hword_part >> WITH (fix_mistyped_characters AND (turkish_hunspell OR turkish_stem) AND >> unaccent); Your syntax looks more flexible and prettier than with JOIN option. As I understand there are three steps here. On each step a dictionary return a lexeme and pass it to next dictionary. If dictionary return NULL then the processing will interrupt. With such syntax we also don't need the TSL_FILTER flag for lexeme. At the current time unaccent extension set this flag to pass a lexeme to a next dictionary. This flag is used by the text-search parser. It looks like a hard coded solution. User can't change this behaviour. Maybe also better to use -> instead of AND? AND would has another behaviour. I could create the following configuration: => ALTER TEXT SEARCH CONFIGURATION multi_conf ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH (german_ispell AND english_ispell) OR simple; which will return both german_ispell and english_ispell results. But I'm not sure that this is a good solution. Of course if this syntax will be implemented, old syntax with commas also should be maintained. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FTS Configuration option
Hello hackers, Sometimes there is a need to collect lexems from various dictionaries. For example, if we have a column with text in various languages. Let's say there is a new option JOIN. This option will allow to parser to append lexems from current dictionary and go to next dictionary to get another lexems: => CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple); => ALTER TEXT SEARCH CONFIGURATION multi_conf ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH german_ispell (JOIN), english_ispell, simple; Here there are the following cases: - found lexem in german_ispell, didn't found lexem in english_ispell. Return lexem from german_ispell. - didn't found lexem in german_ispell, found lexem in english_ispell. Return lexem from english_ispell. - didn't found lexems in dictionaries. Return lexem from simple. - found lexems in both dictionaries. Return lexems from both. Could be such option is useful to the community? Name of the option is debatable. Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FTS Configuration option
Hello hackers, Sometimes there is a need to collect lexems from various dictionaries. For example, if we have a column with text in various languages. Let's say there is a new option JOIN. This option will allow to the parser to append lexems from the current dictionary and go to the next dictionary: => CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple); => ALTER TEXT SEARCH CONFIGURATION multi_conf ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH german_ispell (JOIN), english_ispell, simple; Here there are the following cases: - found lexem in german_ispell, didn't found lexem in english_ispell. Return lexem from german_ispell. - didn't found lexem in german_ispell, found lexem in english_ispell. Return lexem from english_ispell. - didn't found lexems in dictionaries. Return lexem from simple. - found lexems in both dictionaries. Return lexems from both. Could be such option is useful to the community? Name of the option is debatable. Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generic type subscription
e(Oid *containerType, int32 *containerTypmod) I think name of the function is confusing. Maybe use transformContainerType()? +JsonbValue * +JsonbToJsonbValue(Jsonb *jsonb, JsonbValue *val) +{ In this function we could palloc JsonbValue every time and don't use val argument. And maybe better to copy all JsonbContainer from jsonb->root using memcpy(). Instead of assigning reference to jsonb->root. As is the case in JsonbValueToJsonb(). It is necessary to remove the notice about JsonbToJsonbValue() in JsonbValueToJsonb() comments. -static JsonbValue * +JsonbValue * setPath(JsonbIterator **it, Datum *path_elems, Why did you remove static keyword? setPath() is still static. - JsonbValue v; + JsonbValue v, *new = (JsonbValue *) newval; Is declaration of "new" variable necessary here? I think it is extra declaration. Also its name may bring some problems. For example, there is a thread where guys try to port PostgreSQL to C++. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
2016-09-29 13:54 GMT+05:00 amul sul : > > On Thu, Sep 29, 2016 at 2:48 AM, Tom Lane wrote: > > > > I started looking at your 0001-to-timestamp-format-checking-v4.patch > > and this point immediately jumped out at me. Currently the code relies > > ... without any documentation ... on no elog being thrown out of > > parse_format(). That's at the very least trouble waiting to happen. > > There's a hack to deal with errors from within the NUMDesc_prepare > > subroutine, but it's a pretty ugly and underdocumented hack. And what > > you had here was randomly different from that solution, too. > > > > After a bit of thought it seemed to me that a much cleaner fix is to add > > a "valid" flag to the cache entries, which we can leave clear until we > > have finished parsing the new format string. That avoids adding extra > > data copying as you suggested, removes the need for PG_TRY, and just > > generally seems cleaner and more bulletproof. > > > > I've pushed a patch that does it that way. The 0001 patch will need > > to be rebased over that (might just require removal of some hunks, > > not sure). > > > > I also pushed 0002-to-timestamp-validation-v2.patch with some revisions > > (it'd broken acceptance of BC dates, among other things, but I think > > I fixed everything). Thank you for committing the 0002 part of the patch! I wanted to fix cache functions too, but wasn't sure about that. > > > > Since you told us earlier that you'd be on vacation through the end of > > September, I'm assuming that nothing more will happen on this patch during > > this commitfest, so I will mark the CF entry Returned With Feedback. > > Behalf of Artur I've rebased patch, removed hunk dealing with broken > cache entries by copying it, which is no longer required after 83bed06 > commit. > > Commitfest status left untouched, I am not sure how to deal with > "Returned With Feedback" patch. Is there any chance that, this can be > considered again for committer review? Thank you for fixing the patch! Today I have access to the internet and able to fix and test the patch. I've looked at your 0001-to-timestamp-format-checking-v5.patch. It seems nice to me. I suppose it is necessary to fix is_char_separator() declaration. from: static bool is_char_separator(char *str); to: static bool is_char_separator(const char *str); Because now in parse_format() *str argument is const. I attached new version of the patch, which fix is_char_separator() declaration too. Sorry for confusing! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company 0001-to-timestamp-format-checking-v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
On 25.08.2016 13:26, amul sul wrote: Thanks. I've created the entry in https://commitfest.postgresql.org/10/713/ . You can add yourself as a reviewer. Done, added myself as reviewer & changed status to "Ready for Committer". Thanks ! Regards, Amul Sul It seems there is no need to rebase patches. But I will not be able to fix patches for two weeks because of travel if someone will have a note or remark. So it would be nice if someone will be able to fix possible issues for above reasone. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
You are right. I assigned to prev_type NODE_TYPE_SPACE to be able to execute such query: SELECT to_timestamp('---2000JUN', ' MON'); Will be it a proper behaviour? Looks good to me, no one will complain if something working on PG but not on Oracle. Thanks. I've created the entry in https://commitfest.postgresql.org/10/713/ . You can add yourself as a reviewer. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
Hi, #1. Whitespace @ line # 317. Sorry, fixed. #2. Warning at compilation; formatting.c: In function ‘do_to_timestamp’: formatting.c:3049:37: warning: ‘prev_type’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (prev_type == NODE_TYPE_SPACE || prev_type == NODE_TYPE_SEPARATOR) ^ formatting.c:2988:5: note: ‘prev_type’ was declared here prev_type; ^ You can avoid this by assigning zero (or introduce NODE_TYPE_INVAL ) to prev_type at following line: 256 + prev_type; You are right. I assigned to prev_type NODE_TYPE_SPACE to be able to execute such query: SELECT to_timestamp('---2000JUN', ' MON'); Will be it a proper behaviour? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5c1c4f6..36d8b3e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6146,9 +6146,12 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); to_timestamp and to_date - skip multiple blank spaces in the input string unless the + skip multiple blank spaces and printable non letter and non digit + characters in the input string and in the formatting string unless the FX option is used. For example, - to_timestamp('2000JUN', ' MON') works, but + to_timestamp('2000JUN', ' MON'), + to_timestamp('2000JUN', ' MON') + and to_timestamp('2000 JUN', 'MON') work, but to_timestamp('2000JUN', 'FX MON') returns an error because to_timestamp expects one space only. FX must be specified as the first item in diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index bbd97dc..7430013 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -169,6 +169,8 @@ struct FormatNode #define NODE_TYPE_END 1 #define NODE_TYPE_ACTION 2 #define NODE_TYPE_CHAR 3 +#define NODE_TYPE_SEPARATOR 4 +#define NODE_TYPE_SPACE 5 #define SUFFTYPE_PREFIX 1 #define SUFFTYPE_POSTFIX 2 @@ -947,6 +949,7 @@ typedef struct NUMProc static const KeyWord *index_seq_search(char *str, const KeyWord *kw, const int *index); static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type); +static bool is_char_separator(char *str); static void NUMDesc_prepare(NUMDesc *num, FormatNode *n); static void parse_format(FormatNode *node, char *str, const KeyWord *kw, const KeySuffix *suf, const int *index, int ver, NUMDesc *Num); @@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_partial_year_to_2020(int year); -static int strspace_len(char *str); static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode); static void from_char_set_int(int *dest, const int value, const FormatNode *node); static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node); @@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type) return NULL; } +static bool +is_char_separator(char *str) +{ + return ((pg_mblen(str) == 1) && + /* printable character, but not letter and digit */ + ((*str >= '!' && *str <= '/') || + (*str >= ':' && *str <= '@') || + (*str >= '[' && *str <= '`') || + (*str >= '{' && *str <= '~'))); +} + /* -- * Prepare NUMDesc (number description struct) via FormatNode struct * -- @@ -1237,9 +1250,10 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw, { const KeySuffix *s; FormatNode *n; + bool in_text = false, +in_backslash = false; int node_set = 0, -suffix, -last = 0; +suffix; #ifdef DEBUG_TO_FROM_CHAR elog(DEBUG_elog_output, "to_char/number(): run parser"); @@ -1251,6 +1265,55 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw, { suffix = 0; + /* Previous character was a backslash */ + if (in_backslash) + { + /* After backslash should go non-space character */ + if (isspace(*str)) +ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid escape sequence"))); + in_backslash = false; + + n->type = NODE_TYPE_CHAR; + n->character = *str; + n->key = NULL; + n->suffix = 0; + n++; + str++; + continue; + } + /* Previous character was a quote */ + else if (in_text) + { + if (*str == '"') + { +str++; +in_text = false; + } + else if (*str == '\\') + { +str++; +in
Re: [HACKERS] Bug in to_timestamp().
Sorry. I did not get last two mails from Amul. Don't know why. So I reply to another mail. Documented as working case, but unfortunatly it does not : postgres=# SELECT to_timestamp('2000JUN', ' MON'); ERROR: invalid value "---" for "MON" DETAIL: The given value did not match any of the allowed values for this field. Indeed! Fixed it. Now this query executes without error. Added this case to tests. NODE_TYPE_CHAR assumption in else block seems incorrect. What if we have space after double quote? It will again have incorrect output without any error, see below: postgres=# SELECT to_timestamp('Year: 1976, Month: May, Day: 16', postgres(# '"Year:" , "Month:" FMMonth, "Day:" DD'); to_timestamp -- 0006-05-16 00:00:00-07:52:58 (1 row) I guess, we might need NODE_TYPE_SEPARATOR and NODE_TYPE_SPACE check as well? Agree. Fixed and added to tests. Unnecessary hunk? We should not touch any code unless it necessary to implement proposed feature, otherwise it will add unnecessary diff to the patch and eventually extra burden to review the code. Similarly hunk in the patch @ line # 313 - 410, nothing to do with to_timestamp behaviour improvement, IIUC. If you think this changes need to be in, please submit separate cleanup-patch. Fixed it. It was a typo. About lines # 313 - 410. It is necessary to avoid bugs. I wrote aboud it in https://www.postgresql.org/message-id/b2a39359-3282-b402-f4a3-057aae500ee7%40postgrespro.ru : - now DCH_cache_getnew() is called after parse_format(). Because now parse_format() can raise an error and in the next attempt DCH_cache_search() could return broken cache entry. For example, you can test incorrect inputs for to_timestamp(). Try to execute such query several times. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 6355300..0fe50e1 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6146,9 +6146,12 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); to_timestamp and to_date - skip multiple blank spaces in the input string unless the + skip multiple blank spaces and printable non letter and non digit + characters in the input string and in the formatting string unless the FX option is used. For example, - to_timestamp('2000JUN', ' MON') works, but + to_timestamp('2000JUN', ' MON'), + to_timestamp('2000JUN', ' MON') + and to_timestamp('2000 JUN', 'MON') work, but to_timestamp('2000JUN', 'FX MON') returns an error because to_timestamp expects one space only. FX must be specified as the first item in diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index bbd97dc..a3dbcaf 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -169,6 +169,8 @@ struct FormatNode #define NODE_TYPE_END 1 #define NODE_TYPE_ACTION 2 #define NODE_TYPE_CHAR 3 +#define NODE_TYPE_SEPARATOR 4 +#define NODE_TYPE_SPACE 5 #define SUFFTYPE_PREFIX 1 #define SUFFTYPE_POSTFIX 2 @@ -947,6 +949,7 @@ typedef struct NUMProc static const KeyWord *index_seq_search(char *str, const KeyWord *kw, const int *index); static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type); +static bool is_char_separator(char *str); static void NUMDesc_prepare(NUMDesc *num, FormatNode *n); static void parse_format(FormatNode *node, char *str, const KeyWord *kw, const KeySuffix *suf, const int *index, int ver, NUMDesc *Num); @@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_partial_year_to_2020(int year); -static int strspace_len(char *str); static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode); static void from_char_set_int(int *dest, const int value, const FormatNode *node); static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node); @@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type) return NULL; } +static bool +is_char_separator(char *str) +{ + return ((pg_mblen(str) == 1) && + /* printable character, but not letter and digit */ + ((*str >= '!' && *str <= '/') || + (*str >= ':' && *str <= '@') || + (*str >= '[' && *str <= '`') || + (*str >= '{' && *str <= '~'))); +} + /* --
Re: [HACKERS] Bug in to_timestamp().
I attached new patch "0001-to-timestamp-format-checking-v2.patch". It fixes behaviour for Amul's scenarious: Following are few scenarios where we break existing behaviour: SELECT TO_TIMESTAMP('2015-12-31 13:43:36', ' MM DD HH24 MI SS'); SELECT TO_TIMESTAMP('2011$03!18 23_38_15', '-MM-DD HH24:MI:SS'); SELECT TO_TIMESTAMP('2011*03*18 23^38&15', '-MM-DD HH24:MI:SS'); SELECT TO_TIMESTAMP('2011*03!18 #%23^38$15', '-MM-DD$$$HH24:MI:SS'); But current patch behaviour is not that much bad either at least we have errors, but I am not sure about community acceptance. I would like to divert communities' attention on following case: SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD'); For queries above the patch gives an output without any error. Another is, shouldn’t we have error in following cases? SELECT TO_TIMESTAMP('2016-06-13 99:99:99', '-MM-DD HH24:MI:SS'); SELECT TO_TIMESTAMP('2016-02-30 15:43:36', '-MM-DD HH24:MI:SS'); I attached second patch "0002-to-timestamp-validation-v2.patch". With it PostgreSQL perform additional checks for date and time. But as I wrote there is another patch in the thread "to_date_valid()" wich differs from this patch. Sincerely, -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7830334..559c55f 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6083,9 +6083,12 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); to_timestamp and to_date - skip multiple blank spaces in the input string unless the + skip multiple blank spaces and printable non letter and non digit + characters in the input string and in the formatting string unless the FX option is used. For example, - to_timestamp('2000JUN', ' MON') works, but + to_timestamp('2000JUN', ' MON'), + to_timestamp('2000JUN', ' MON') + and to_timestamp('2000 JUN', 'MON') work, but to_timestamp('2000JUN', 'FX MON') returns an error because to_timestamp expects one space only. FX must be specified as the first item in diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index bbd97dc..b14678d 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -169,6 +169,8 @@ struct FormatNode #define NODE_TYPE_END 1 #define NODE_TYPE_ACTION 2 #define NODE_TYPE_CHAR 3 +#define NODE_TYPE_SEPARATOR 4 +#define NODE_TYPE_SPACE 5 #define SUFFTYPE_PREFIX 1 #define SUFFTYPE_POSTFIX 2 @@ -947,6 +949,7 @@ typedef struct NUMProc static const KeyWord *index_seq_search(char *str, const KeyWord *kw, const int *index); static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type); +static bool is_char_separator(char *str); static void NUMDesc_prepare(NUMDesc *num, FormatNode *n); static void parse_format(FormatNode *node, char *str, const KeyWord *kw, const KeySuffix *suf, const int *index, int ver, NUMDesc *Num); @@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_partial_year_to_2020(int year); -static int strspace_len(char *str); static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode); static void from_char_set_int(int *dest, const int value, const FormatNode *node); static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node); @@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type) return NULL; } +static bool +is_char_separator(char *str) +{ + return ((pg_mblen(str) == 1) && + /* printable character, but not letter and digit */ + ((*str >= '!' && *str <= '/') || + (*str >= ':' && *str <= '@') || + (*str >= '[' && *str <= '`') || + (*str >= '{' && *str <= '~'))); +} + /* -- * Prepare NUMDesc (number description struct) via FormatNode struct * -- @@ -1237,9 +1250,10 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw, { const KeySuffix *s; FormatNode *n; + bool in_text = false, +in_backslash = false; int node_set = 0, -suffix, -last = 0; +suffix; #ifdef DEBUG_TO_FROM_CHAR elog(DEBUG_elog_output, "to_char/number(): run parser"); @@ -1251,6 +1265,49 @@ parse_format(FormatNode *node,
Re: [HACKERS] Bug in to_timestamp().
On 15.08.2016 19:28, Robert Haas wrote: Well, what's the Oracle behavior in any of these cases? I don't think we can decide to change any of this behavior without knowing that. If a proposed behavior change is incompatible with our previous releases, I think it'd better at least be more compatible with Oracle. Otherwise, we're just changing from an established behavior that we invented ourselves to a new behavior we invented ourselves, which is only worthwhile if it's absolutely clear that the new behavior is way better. 1 - Oracle's output for first queries is: -> SELECT TO_TIMESTAMP('2015-12-31 13:43:36', ' MM DD HH24 MI SS') FROM dual; TO_TIMESTAMP('2015-12-3113:43:36','MMDDHH24MISS') --- 31-DEC-15 01.43.36.0 PM -> SELECT TO_TIMESTAMP('2011$03!18 23_38_15', '-MM-DD HH24:MI:SS') FROM dual; TO_TIMESTAMP('2011$03!1823_38_15','-MM-DDHH24:MI:SS') --- 18-MAR-11 11.38.15.0 PM -> SELECT TO_TIMESTAMP('2011*03!18 #%23^38$15', '-MM-DD$$$HH24:MI:SS') FROM dual; TO_TIMESTAMP('2011*03!18#%23^38$15','-MM-DD$$$HH24:MI:SS') --- 18-MAR-11 11.38.15.0 PM PostgreSQL with the patch gives "ERROR: expected space character in given string". I will fix this. 2 - Oracle's output for query with hyphen is: -> SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD') FROM dual; SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD') FROM dual * ERROR at line 1: ORA-01843: not a valid month Here PostgreSQL with the patch does not give an error. So I will fix this too. 3 - The last two queries give an error. This patch do not handle such queries intentionally, because there is the thread https://www.postgresql.org/message-id/57786490.9010201%40wars-nicht.de . That thread concerns to_date() function. But it should concerns to_timestamp() also. So I suppose that should be a different patch for this last case. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to_date_valid()
On 15.08.2016 14:33, Andreas 'ads' Scherbaum wrote: Is it right and "true" way to validate date by extra transforming and comparison? Maybe validate date by using ValidateDate(). Attached sample patch. This does not solve the problem at hand, and let's wrong dates/formats slip through: ./buildclient.py -v -c demo-config-pg.yaml --run-configure --run-make --run-install --no-clean-at-all --patch 'https://www.postgresql.org/message-id/95738e12-6ed6-daf5-9dcf-6336072e6b15%40postgrespro.ru' postgres=# SELECT to_date('2011 12 18', ' MM DD'); to_date 2011-12-08 (1 row) That is from the regression tests, and obviously handles the date transformation wrong. My attempt catches this, because I compare the date with the input date, and do not rely on a valid date only. I suppose that your sample query is an another issue, not date validate task. I sent the patch to the thread https://www.postgresql.org/message-id/b2a39359-3282-b402-f4a3-057aae500...@postgrespro.ru . It fixes formatting issues. I thought that it is better to distinguish this issues to: - validation of input date/timestmap string and input format string - result date/timestamp validation -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to_date_valid()
On 14.08.2016 01:52, Andreas 'ads' Scherbaum wrote: Attached is a patch to "do the right thing". The verification is in "to_date()" now, the extra function is removed. Regression tests are updated - two or three of them returned a wrong date before, and still passed. They fail now. Documentation is also updated. Regards, Is it right and "true" way to validate date by extra transforming and comparison? Maybe validate date by using ValidateDate(). Attached sample patch. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index bbd97dc..c0048c9 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -3563,7 +3563,9 @@ do_to_timestamp(text *date_txt, text *fmt, { FormatNode *format; TmFromChar tmfc; - int fmt_len; + int fmt_len, +fmask = 0; /* Bit mask for ValidateDate() */ + char *date_str = NULL; ZERO_tmfc(&tmfc); ZERO_tm(tm); @@ -3574,7 +3576,6 @@ do_to_timestamp(text *date_txt, text *fmt, if (fmt_len) { char *fmt_str; - char *date_str; bool incache; fmt_str = text_to_cstring(fmt); @@ -3630,7 +3631,6 @@ do_to_timestamp(text *date_txt, text *fmt, DCH_from_char(format, date_str, &tmfc); - pfree(date_str); pfree(fmt_str); if (!incache) pfree(format); @@ -3706,6 +3706,7 @@ do_to_timestamp(text *date_txt, text *fmt, if (tmfc.bc && tm->tm_year > 0) tm->tm_year = -(tm->tm_year - 1); } + fmask |= DTK_M(YEAR); } else if (tmfc.cc) /* use first year of century */ { @@ -3717,10 +3718,14 @@ do_to_timestamp(text *date_txt, text *fmt, else /* +1 because year == 599 is 600 BC */ tm->tm_year = tmfc.cc * 100 + 1; + fmask |= DTK_M(YEAR); } if (tmfc.j) + { j2date(tmfc.j, &tm->tm_year, &tm->tm_mon, &tm->tm_mday); + fmask |= DTK_DATE_M; + } if (tmfc.ww) { @@ -3734,6 +3739,7 @@ do_to_timestamp(text *date_txt, text *fmt, isoweekdate2date(tmfc.ww, tmfc.d, &tm->tm_year, &tm->tm_mon, &tm->tm_mday); else isoweek2date(tmfc.ww, &tm->tm_year, &tm->tm_mon, &tm->tm_mday); + fmask |= DTK_DATE_M; } else tmfc.ddd = (tmfc.ww - 1) * 7 + 1; @@ -3744,11 +3750,17 @@ do_to_timestamp(text *date_txt, text *fmt, if (tmfc.d) tm->tm_wday = tmfc.d - 1; /* convert to native numbering */ if (tmfc.dd) + { tm->tm_mday = tmfc.dd; + fmask |= DTK_M(DAY); + } if (tmfc.ddd) tm->tm_yday = tmfc.ddd; if (tmfc.mm) + { tm->tm_mon = tmfc.mm; + fmask |= DTK_M(MONTH); + } if (tmfc.ddd && (tm->tm_mon <= 1 || tm->tm_mday <= 1)) { @@ -3771,6 +3783,7 @@ do_to_timestamp(text *date_txt, text *fmt, j0 = isoweek2j(tm->tm_year, 1) - 1; j2date(j0 + tmfc.ddd, &tm->tm_year, &tm->tm_mon, &tm->tm_mday); + fmask |= DTK_DATE_M; } else { @@ -3793,9 +3806,24 @@ do_to_timestamp(text *date_txt, text *fmt, if (tm->tm_mday <= 1) tm->tm_mday = tmfc.ddd - y[i - 1]; + + fmask |= DTK_M(MONTH) | DTK_M(DAY); } } + /* Validate date with bit mask received above */ + if (fmask != 0 && date_str) + { + if (ValidateDate(fmask, false, false, false, tm) != 0) + ereport(ERROR, + (errcode(ERRCODE_DATETIME_FIELD_OVERFLOW), + errmsg("date/time field value out of range: \"%s\"", + date_str))); + } + + if (date_str) + pfree(date_str); + #ifdef HAVE_INT64_TIMESTAMP if (tmfc.ms) *fsec += tmfc.ms * 1000; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
Hello, On 14.07.2016 12:16, Pavel Stehule wrote: last point was discussed in thread related to to_date_valid function. Regards Pavel Thank you. Here is my patch. It is a proof of concept. Date/Time Formatting There are changes in date/time formatting rules: - now to_timestamp() and to_date() skip spaces in the input string and in the formatting string unless FX option is used, as Amul Sul wrote on first message of this thread. But Ex.2 gives an error now with this patch (should we fix this too?). - in the code space characters and separator characters have different types of FormatNode. Separator characters are characters ',', '-', '.', '/' and ':'. This is done to have different rules of formatting to space and separator characters. If FX option isn't used then PostgreSQL do not insist that separator in the formatting string should match separator in the formatting string. But count of separators should be equal with or without FX option. - now PostgreSQL check is there a closing quote. Otherwise the error is raised. Still PostgreSQL do not insist that text character in the formatting string should match text character in the input string. It is not obvious if this should be fixed. Because we may have different character case or character with accent mark or without accent mark. But I suppose that it is not right just check text character count. For example, there is unicode version of space character U+00A0. Code changes - new defines: #define NODE_TYPE_SEPARATOR 4 #define NODE_TYPE_SPACE 5 - now DCH_cache_getnew() is called after parse_format(). Because now parse_format() can raise an error and in the next attempt DCH_cache_search() could return broken cache entry. This patch do not handle all noticed issues in this thread, since still there is not consensus about them. So this patch in a proof of concept status and it can be changed. Of course this patch can be completely wrong. But it tries to introduce more formal rules for formatting. I will be grateful for notes and remarks. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7830334..5656245 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6083,9 +6083,10 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); to_timestamp and to_date - skip multiple blank spaces in the input string unless the - FX option is used. For example, - to_timestamp('2000JUN', ' MON') works, but + skip multiple blank spaces in the input string and in the formatting + string unless the FX option is used. For example, + to_timestamp('2000JUN', ' MON') + and to_timestamp('2000 JUN', 'MON') work, but to_timestamp('2000JUN', 'FX MON') returns an error because to_timestamp expects one space only. FX must be specified as the first item in @@ -6121,6 +6122,19 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); skips two input characters. + + + + In to_date and to_timestamp separator + characters ',', '-', + '.', '/' and ':' + outside of double-quoted strings skip the number of input characters + contained in the string unless the FX option is used. + If FX option is specified then consumed separator + character in the formatting string must match the separator character + in the input string. + + diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index bbd97dc..ef39df9 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -169,6 +169,8 @@ struct FormatNode #define NODE_TYPE_END 1 #define NODE_TYPE_ACTION 2 #define NODE_TYPE_CHAR 3 +#define NODE_TYPE_SEPARATOR 4 +#define NODE_TYPE_SPACE 5 #define SUFFTYPE_PREFIX 1 #define SUFFTYPE_POSTFIX 2 @@ -947,6 +949,7 @@ typedef struct NUMProc static const KeyWord *index_seq_search(char *str, const KeyWord *kw, const int *index); static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type); +static bool is_char_separator(char *str); static void NUMDesc_prepare(NUMDesc *num, FormatNode *n); static void parse_format(FormatNode *node, char *str, const KeyWord *kw, const KeySuffix *suf, const int *index, int ver, NUMDesc *Num); @@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max); static const char *get_th(char *num, int type); static char *str_numth(char *dest, char *num, int type); static int adjust_parti
[HACKERS] pg_upgrade: exit_hook_registered variable
Hello hackers, I noticed that exit_hook_registered variable in start_postmaster() is local variable. Shouldn't it be a static variable? I attached a patch. Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c index 969e5d6..a13800f 100644 --- a/src/bin/pg_upgrade/server.c +++ b/src/bin/pg_upgrade/server.c @@ -14,6 +14,8 @@ static PGconn *get_db_conn(ClusterInfo *cluster, const char *db_name); +static bool exit_hook_registered = false; + /* * connectToServer() @@ -174,7 +176,6 @@ start_postmaster(ClusterInfo *cluster, bool throw_error) { char cmd[MAXPGPATH * 4 + 1000]; PGconn *conn; - bool exit_hook_registered = false; bool pg_ctl_return = false; char socket_string[MAXPGPATH + 200]; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ispell/hunspell imprecision in error message
On 25.07.2016 19:53, Peter Eisentraut wrote: I was trying to look up the background of this error message: "Ispell dictionary supports only \"default\", \"long\", and \"num\" flag value" (src/backend/tsearch/spell.c) But I found that this actually talks about Hunspell format dictionaries. (So the man page is hunspell(5) as opposed to ispell(5).) While as far as the tsearch interfaces are concerned, those two are lumped together, should we not change the error message to refer to Hunspell? Hello, As I understand, this error message refers to the Ispell dictionary template. This template can use various dictionary formats. Which is confusing. Maybe would be better to change dictionary template name. But it can break backward compatibility... If we want to change the error message, maybe change it to the following? "Hunspell dictionary format supports only \"default\", \"long\", and \"num\" flag value" -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in to_timestamp().
On 23.06.2016 21:02, Tom Lane wrote: Robert Haas writes: On Thu, Jun 23, 2016 at 1:40 PM, Tom Lane wrote: At the very least I'd want to see a thought-through proposal that addresses all three of these interrelated points: * what should a space in the format match * what should a non-space, non-format-code character in the format match * how should we handle fields that are not exactly the width suggested by the format I'm not averse to some further study of those issues, and I think the first two are closely related. The third one strikes me as a somewhat separate consideration that doesn't need to be addressed by the same patch. If you think those issues are not interrelated, you have not thought about it carefully enough. As an example, what we can do to handle not-expected-width fields is very different if the format is "DDMMYY" versus if it is "DD-MM-YY". In the first case we have little choice but to believe that each field is exactly two digits wide. In the second case, depending on how we decide to define matching of "-", we might be able to allow the field widths to vary so that they're effectively "whatever is between the dashes". But that would require insisting that "-" match a "-", or at least a non-alphanumeric, which is not how it behaves today. I don't want to twiddle these behaviors in 9.6 and then again next year. regards, tom lane Hi, I want to start work on this patch. As a conclusion: - need a decision about three questions: * what should a space in the format match * what should a non-space, non-format-code character in the format match * how should we handle fields that are not exactly the width suggested by the format - nobody wants solve this issue in 9.6. And I have question: what about wrong input in date argument? For example, from Alex's message: postgres=# SELECT TO_TIMESTAMP('2016-02-30 15:43:36', '-MM-DD HH24:MI:SS'); to_timestamp 2016-03-01 15:43:36+03 (1 row) Here '2016-02-30' is wrong date. I didn't see any conclusion about this case in the thread. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch: fix typo, duplicated word in indexam.sgml
Hello, There is a duplicated word in indexam.sgml. The patch is attached. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index b36889b..69edeea 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -58,7 +58,7 @@ - Index access access methods can be defined and dropped using + Index access methods can be defined and dropped using and SQL commands respectively. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
On 29.03.2016 19:17, Shulgin, Oleksandr wrote: Hm, indeed. Unfortunately, it is not quite easy to find "the" new RFC, there was quite a number of correcting and extending RFCs issued over the last (almost) 30 years, which is not that surprising... Are we going to do something about it? Is it likely that relaxing/changing the rules on our side will break any possible workarounds that people might have employed to make the search work like they want it to work? Do you mean here workarounds to recognize such values as 't...@123-reg.ro' as an email address? Actually I do not see any workarounds except a patch to PostgreSQL. By the way, Teodor committed the patch yesterday. -- Alex -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IF (NOT) EXISTS in psql-completion
On 29.03.2016 10:59, Pavel Stehule wrote: Hi 2016-03-29 8:43 GMT+02:00 Kyotaro HORIGUCHI mailto:horiguchi.kyot...@lab.ntt.co.jp>>: Thank you Pavel, David. Thank you for pointing syntaxes to be addressed. Most of the are addressed in the attached patch. At Tue, 22 Mar 2016 12:57:27 -0400, David Steele mailto:da...@pgmasters.net>> wrote in <56f17977.8040...@pgmasters.net <mailto:56f17977.8040...@pgmasters.net>> > Hi Kyotaro, > > On 3/18/16 3:22 AM, Pavel Stehule wrote: > > > I am looking this patch. It looks well, but this feature doesn't > > respect upper or lower chars. It enforce upper chars. This is not > > consistent with any other autocomplete. As mentioned before, upper-lower problem is an existing issue. The case of the words in a query result list cannot be edited since it may contain words that should not be changed, such as relation names. So we can address it only before issueing a query but I haven't found simple way to do it. This is unpleasant. I am sorry. I had very uncomfortable feeling from this behave. I am thinking so it should be solvable - you have to convert only keyword IF EXISTS or IF NOT EXISTS. Maybe there are not trivial solution, but this should be fixed. Hello, Can we do something like in the patch? This patch should be applied after the patch "0001-Suggest-IF-NOT-EXISTS-for-tab-completion-of-psql_v3.patch". -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 73c5601..ed4ff09 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -153,6 +153,7 @@ do { \ do { \ completion_squery = &(query); \ completion_charp = addon; \ + completion_case_sensitive = false; \ matches = completion_matches(text, complete_from_schema_query); \ } while (0) @@ -3754,7 +3755,17 @@ _complete_from_query(int is_schema_query, const char *text, int state) while (list_index < PQntuples(result) && (item = PQgetvalue(result, list_index++, 0))) if (pg_strncasecmp(text, item, byte_length) == 0) -return pg_strdup(item); + { +if (completion_case_sensitive) + return pg_strdup(item); +else + + /* + * If case insensitive matching was requested initially, + * adjust the case according to setting. + */ + return pg_strdup_keyword_case(item, text); + } } /* If nothing matches, free the db structure and return null */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Phrase search ported to 9.6
On 25.03.2016 21:42, Dmitry Ivanov wrote: Sorry for the delay, I desperately needed some time to finish a bunch of dangling tasks. I've added some new comments and clarified the ones that were obscure. Moreover, I felt an urge to recheck most parts of the code since apparently nobody (besides myself) has gone so far yet. On 25.03.16 18:42 MSK, David Steele wrote: Time is short and it's not encouraging that you say there is "still much work to be done". Indeed, I was inaccurate. I am more than interested in the positive outcome. Hi, Thank you for your work! I tested the patch and take a look on it. All regression tests passed. The code looks good and the patch introduce a great functionality. I think the patch can be marked as "Ready for Commiter". But I do not feel the force to do that myself. Also I agree with you about tsvector_setweight(). There is not a problem with it because this weights are immutable and so there is not benefits from new function. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
On 25.03.2016 19:15, David Steele wrote: On 3/25/16 12:14 PM, Artur Zakirov wrote: On 25.03.2016 18:19, David Steele wrote: Hi Artur, On 3/20/16 10:42 AM, Tom Lane wrote: "Shulgin, Oleksandr" writes: On Mar 20, 2016 01:09, "Dmitrii Golub" wrote: Alex, actually subdomain can start with digit, Not according to the RFC you have linked to. The powers-that-be relaxed that some time ago; I assume there's a newer RFC. For instance, "163.com" is a real domain: You marked this patch "needs review" and then a few minutes later changed it to "waiting on author". If this was a mistake please change it back to "needs review". If you really are working on a new patch when can we expect that? Thanks, Hi, The previous patch is current, which can be commited. I mark this patch as "needs review", because I noticed that the patch was marked as "waiting on author". And I thought that I forgot to mark as "need review". But then I noticed that Robert Haas marked the patch as "waiting on author" after my answer, and I returned "waiting on author". But I cant find any questions or comments to me after my last answer. Actually I think that this patch should be marked as "need review". Done. Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
On 25.03.2016 18:19, David Steele wrote: Hi Artur, On 3/20/16 10:42 AM, Tom Lane wrote: "Shulgin, Oleksandr" writes: On Mar 20, 2016 01:09, "Dmitrii Golub" wrote: Alex, actually subdomain can start with digit, Not according to the RFC you have linked to. The powers-that-be relaxed that some time ago; I assume there's a newer RFC. For instance, "163.com" is a real domain: You marked this patch "needs review" and then a few minutes later changed it to "waiting on author". If this was a mistake please change it back to "needs review". If you really are working on a new patch when can we expect that? Thanks, Hi, The previous patch is current, which can be commited. I mark this patch as "needs review", because I noticed that the patch was marked as "waiting on author". And I thought that I forgot to mark as "need review". But then I noticed that Robert Haas marked the patch as "waiting on author" after my answer, and I returned "waiting on author". But I cant find any questions or comments to me after my last answer. Actually I think that this patch should be marked as "need review". -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
I attached the patch, which fixes the pg_trgm documentation. On 19.03.2016 01:18, Artur Zakirov wrote: 2016-03-18 23:46 GMT+03:00 Jeff Janes mailto:jeff.ja...@gmail.com>>: <% and <<-> are not documented at all. Is that a deliberate choice? Since they were added as convenience functions for the user, I think they really need to be in the user documentation. I can send a patch a little bit later. I documented %> and <->> because examples of other operators have the following order: SELECT t, t <-> 'word' AS dist FROM test_trgm ORDER BY dist LIMIT 10; and SELECT * FROM test_trgm WHERE t LIKE '%foo%bar'; I did not include <% and <<-> because I did not know how to document commutators. But I can fix it. And honestly the following order: SELECT 'word' <% t FROM test_trgm; is more convenient to me too. Do you know how do not break the line in the operators table in the first column? Now I see: Operator| Returns |-- text % |boolean text | But the following is better: Operator| Returns |-- text % text |boolean Also, the documentation should probably include <% and <<-> as the "parent" operators and use them in the examples, and only mention %> and <->> in passing, as the commutators. That is because <% and <<-> take their arguments in the same order as word_similarity does. It would be less confusing if the documentation and the examples didn't need to keep changing the argument orders. Cheers, Jeff -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/pgtrgm.sgml --- b/doc/src/sgml/pgtrgm.sgml *** *** 153,167 !text %> text !boolean ! ! Returns true if its first argument has the similar word in ! the second argument and they have a similarity that is greater than the ! current word similarity threshold set by ! pg_trgm.word_similarity_threshold parameter. ! ! text <-> text real --- 153,174 ! text <% text ! boolean ! !Returns true if its first argument has the similar word in !the second argument and they have a similarity that is greater than the !current word similarity threshold set by !pg_trgm.word_similarity_threshold parameter. ! ! ! ! text %> text ! boolean ! !Commutator of the <% operator. ! ! text <-> text real *** *** 171,184 ! ! text <->> text ! !real ! ! Returns the distance between the arguments, that is ! one minus the word_similarity() value. ! --- 178,200 ! !text <<-> text ! ! real ! !Returns the distance between the arguments, that is !one minus the word_similarity() value. ! ! ! ! !text <->> text ! ! real ! !Commutator of the <<-> operator. ! *** *** 215,221 Sets the current word similarity threshold that is used by !the %> operator. The threshold must be between 0 and 1 (default is 0.6). --- 231,237 Sets the current word similarity threshold that is used by !the <% operator. The threshold must be between 0 and 1 (default is 0.6). *** *** 283,289 SELECT t, t <-> 'word' AS dist SELECT t, word_similarity('word', t) AS sml FROM test_trgm ! WHERE t %> 'word' ORDER BY sml DESC, t; This will return all values in the text column that have a word --- 299,305 SELECT t, word_similarity('word', t) AS sml FROM test_trgm ! WHERE 'word' <% t ORDER BY sml DESC, t; This will return all values in the text column that have a word *** *** 295,301 SELECT t, word_similarity('word', t) AS sml A variant of the above query is ! SELECT t, t <->> 'word' AS dist FROM test_trgm ORDER BY dist LIMIT 10; --- 311,317 A variant of the above query is ! SELECT t, 'word' <<-> t AS dist FROM test_trgm ORDER BY dist LIMIT 10; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Phrase search ported to 9.6
I tried to find some bugs in the code. I can't find them. But it does not mean that there are not bugs. Still there are a lack of comments and trailing whitespaces. On 16.03.2016 19:38, Dmitry Ivanov wrote: Hi, Artur I've made an attempt to fix some of the issues you've listed, although there's still much work to be done. I'll add some comments later. This function has the duplicated piece from the tsvector_setweight() from tsvector_op.c. You can define new function for it. I'm not sure it's worth the trouble. IMO these functions are relatively small and we won't benefit from extracting the duplicate code. I think that a separate function would be better. These weights are occurred in other functions. But I might be wrong and maybe I cavil at the code too much. These functions was removed in 9acb9007de30b3daaa9efc16763c3bc6e3e0a92d. It seems that tsvector_op.c was not synchronized with the master. Got it, thanks! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
I found the discussion about allowing an underscore in emails http://www.postgresql.org/message-id/200908281359.n7sdxfaf044...@wwwmaster.postgresql.org That bug report is about recognizing an underscore in the local part of an email. And is not about recognizing an underscore in a domain name. But that patch allows an underscore in recognized host names also. I am not good in RFC, so I put excerpt from Wikipedia https://en.wikipedia.org/wiki/Email_address: The local-part of the email address may use any of these ASCII characters: Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122) Digits 0 to 9 (ASCII: 48–57) These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 47, 61, 63, 94–96, 123–126) Character . (dot, period, full stop), ASCII 46, provided that it is not the first or last character, and provided also that it does not appear consecutively (e.g. john.@example.com is not allowed). Other special characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash). These characters are: Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93) Comments are allowed with parentheses at either end of the local part; e.g. john.smith(comment)@example.com and (comment)john.sm...@example.com are both equivalent to john.sm...@example.com. and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner),the digits '0' through '9', and the hyphen ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or white space are permitted. Hence the valid emails is (I might be wrong): 12...@sample.com 12...@sample.com 1...@123-sample.com 1...@123sample.com The attached patch allow them to be recognized as a email. But this patch does not prohibit underscore in recognized host names. As a result this patch gives the following results with underscores: =# select * from ts_debug('simple', 'aaa@123_yyy.zzz'); alias | description | token | dictionaries | dictionary | lexemes ---+---+-+--++--- email | Email address | aaa@123_yyy.zzz | {simple} | simple | {aaa@123_yyy.zzz} (1 row) =# select * from ts_debug('simple', '123_yyy.zzz'); alias | description |token| dictionaries | dictionary | lexemes ---+-+-+--++--- host | Host | 123_yyy.zzz | {simple} | simple | {123_yyy.zzz} (1 row) On 14.03.2016 17:45, Artur Zakirov wrote: On 14.03.2016 16:22, Shulgin, Oleksandr wrote: Hm... now that doesn't look all that consistent to me (after applying the patch): =# select ts_debug('simple', 'a...@123-yyy.zzz'); ts_debug --- (email,"Email address",a...@123-yyy.zzz,{simple},simple,{a...@123-yyy.zzz}) (1 row) But: =# select ts_debug('simple', 'aaa@123_yyy.zzz'); ts_debug - (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa}) (blank,"Space symbols",@,{},,) (uint,"Unsigned integer",123,{simple},simple,{123}) (blank,"Space symbols",_,{},,) (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) (5 rows) One can also see that if we only keep the domain name, the result is similar: =# select ts_debug('simple', '123-yyy.zzz'); ts_debug --- (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz}) (1 row) =# select ts_debug('simple', '123_yyy.zzz'); ts_debug - (uint,"Unsigned integer",123,{simple},simple,{123}) (blank,"Space symbols",_,{},,) (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) (3 rows) But, this only has to do with 123 being recognized as a number, not with the underscore: =# select ts_debug('simple', 'abc_yyy.zzz'); ts_debug --- (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz}
Re: [HACKERS] Proposal: Generic WAL logical messages
On 16.03.2016 18:56, David Steele wrote: This patch applies cleanly and is ready for review with no outstanding issues that I can see. Simon and Artur, you are both signed up as reviewers. Care to take a crack at it? Thanks, I have tested the patch once again and have looked the code. It looks good for me. I haven't any observation. The patch does its function correctly. I have tested it with a plugin, which writes DDL queries to the WAL using a hook and replicates this queries at subscribers. If Simon is not against, the patch can be marked as "Ready for Commiter". -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
2016-03-18 23:46 GMT+03:00 Jeff Janes : > > > <% and <<-> are not documented at all. Is that a deliberate choice? > Since they were added as convenience functions for the user, I think > they really need to be in the user documentation. > I can send a patch a little bit later. I documented %> and <->> because examples of other operators have the following order: SELECT t, t <-> 'word' AS dist FROM test_trgm ORDER BY dist LIMIT 10; and SELECT * FROM test_trgm WHERE t LIKE '%foo%bar'; I did not include <% and <<-> because I did not know how to document commutators. But I can fix it. And honestly the following order: SELECT 'word' <% t FROM test_trgm; is more convenient to me too. Do you know how do not break the line in the operators table in the first column? Now I see: Operator | Returns |-- text % |boolean text | But the following is better: Operator | Returns |-- text % text |boolean > > Also, the documentation should probably include <% and <<-> as the > "parent" operators and use them in the examples, and only mention %> > and <->> in passing, as the commutators. That is because <% and <<-> > take their arguments in the same order as word_similarity does. It > would be less confusing if the documentation and the examples didn't > need to keep changing the argument orders. > > Cheers, > > Jeff > -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 15.03.2016 17:28, David Steele wrote: On 3/14/16 12:27 PM, Artur Zakirov wrote: On 14.03.2016 18:48, David Steele wrote: Hi Jeff, On 2/25/16 5:00 PM, Jeff Janes wrote: But, It doesn't sound like I am going to win that debate. Given that, I don't think we need a different name for the function. I'm fine with explaining the word-boundary subtlety in the documentation, and keeping the function name itself simple. It's not clear to me if you are requesting more documentation here or stating that you are happy with it as-is. Care to elaborate? Other than that I think this patch looks to be ready for committer. Any objections? There was some comments about the word-boundary subtlety. But I think it was not enough. I rephrased the explanation of word_similarity() and %>. It is better now. But if it is not correct I can change the explanation. Since to only change in the latest patch is to documentation I have marked this "ready for committer". Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 14.03.2016 18:48, David Steele wrote: Hi Jeff, On 2/25/16 5:00 PM, Jeff Janes wrote: But, It doesn't sound like I am going to win that debate. Given that, I don't think we need a different name for the function. I'm fine with explaining the word-boundary subtlety in the documentation, and keeping the function name itself simple. It's not clear to me if you are requesting more documentation here or stating that you are happy with it as-is. Care to elaborate? Other than that I think this patch looks to be ready for committer. Any objections? There was some comments about the word-boundary subtlety. But I think it was not enough. I rephrased the explanation of word_similarity() and %>. It is better now. But if it is not correct I can change the explanation. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/contrib/pg_trgm/pg_trgm--1.2.sql --- b/contrib/pg_trgm/pg_trgm--1.2.sql *** *** 3,13 --- 3,15 -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit + -- Deprecated function CREATE FUNCTION set_limit(float4) RETURNS float4 AS 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE; + -- Deprecated function CREATE FUNCTION show_limit() RETURNS float4 AS 'MODULE_PATHNAME' *** *** 26,32 LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on trgm_limit CREATE OPERATOR % ( LEFTARG = text, --- 28,34 CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on pg_trgm.similarity_threshold CREATE OPERATOR % ( LEFTARG = text, *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 105,111 typedef char *BITVECP; typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern float4 trgm_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); --- 105,111 typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern double similarity_threshold; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 206,212 gin_trgm_consistent(PG_FUNCTION_ARGS) * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 206,214 * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ! ((float4) ntrue) / ((float4) nkeys))) >= similarity_threshold) ! ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** *** 283,289 gin_trgm_triconsistent(PG_FUNCTION_ARGS) /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 285,293 /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : ! (float4) ntrue) / ((float4) nkeys)) >= similarity_threshold) ! ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** a/contrib/pg_trgm/trgm_gist.c --- b/contrib/pg_trgm/trgm_gist.c *** *** 294,300 gtrgm_consistent(PG_FUNCTION_ARGS) float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &trgm_limit || tmpsml > trgm_limit) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ --- 294,301 float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &similarity_threshold ! || tmpsml > similarity_threshold) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ *** *** 308,314 gtrgm_consistent(PG_FUNCTION_ARGS) if (len == 0) res = false; else ! res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false; } break; case ILikeStrategyNumber: --- 309,316 if (len == 0) res = false; else ! res = (f
Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types
On 14.03.2016 17:54, Tom Lane wrote: Joe Conway writes: This new version of the patch was posted after the commitfest item was marked ready for committer. Does anyone have further comments or objections to the concept or syntax before I try to take this forward? The quoted excerpt fails to say what solution was adopted to the array syntax issues, so it's impossible to have an opinion without actually reading the patch. However ... one thing I was intending to mention on this thread is that "get the array type over this type" isn't the only extension one might wish for. Another likely desire is "get the type of field 'foo' of this composite type". I don't suggest that this patch needs to implement that right now; but it would be a good thing if we could see how the chosen syntax could be extended in such a direction. Otherwise we might be painting ourselves into a corner. regards, tom lane I looked this patch and the previous. The patch applies correctly to HEAD. Regression tests pass successfully, without errors. In comparison with the previous patch it adds the following functionality: - %TYPE - now can be used for composite types (same syntax). - %TYPE[] - new functionality, provides the array type from a variable or table column (syntax was changed). - var ELEMENT OF othervar%TYPE - new funcitonality, provides the element type of a given array (syntax was changed). Was changed plpgsql_derive_type(). Now it has the following definition: PLpgSQL_type * plpgsql_derive_type(PLpgSQL_type *base_type, bool to_element_type, bool to_array_type) Previously it had the following definition: static PLpgSQL_type * derive_type(PLpgSQL_type *base_type, PLpgSQL_reftype reftype) where PLpgSQL_reftype was the enum: + typedef enum + { + PLPGSQL_REFTYPE_TYPE, /* use type of some variable */ + PLPGSQL_REFTYPE_ELEMENT,/* use a element type of referenced variable */ + PLPGSQL_REFTYPE_ARRAY /* use a array type of referenced variable */ + } PLpgSQL_reftype; I think the previous version was better, because enum is better than additional function parameters. But it is only for me. Also there is a little typo here: + * This routine is used for generating element or array type from base type. + * The options to_element_type and to_array_type can be used together, when + * we would to ensure valid result. The array array type is original type, so + * this direction is safe. The element of scalar type is not allowed, but if + * we do "to array" transformation first, then this direction should be safe + * too. This design is tolerant, because we should to support a design of + * polymorphic parameters, where a array value can be passed as anyelement + * or anyarray parameter. + */ + PLpgSQL_type * + plpgsql_derive_type(PLpgSQL_type *base_type, Here the word "array" occurs two times in the third line. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
On 14.03.2016 16:22, Shulgin, Oleksandr wrote: Hm... now that doesn't look all that consistent to me (after applying the patch): =# select ts_debug('simple', 'a...@123-yyy.zzz'); ts_debug --- (email,"Email address",a...@123-yyy.zzz,{simple},simple,{a...@123-yyy.zzz}) (1 row) But: =# select ts_debug('simple', 'aaa@123_yyy.zzz'); ts_debug - (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa}) (blank,"Space symbols",@,{},,) (uint,"Unsigned integer",123,{simple},simple,{123}) (blank,"Space symbols",_,{},,) (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) (5 rows) One can also see that if we only keep the domain name, the result is similar: =# select ts_debug('simple', '123-yyy.zzz'); ts_debug --- (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz}) (1 row) =# select ts_debug('simple', '123_yyy.zzz'); ts_debug - (uint,"Unsigned integer",123,{simple},simple,{123}) (blank,"Space symbols",_,{},,) (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) (3 rows) But, this only has to do with 123 being recognized as a number, not with the underscore: =# select ts_debug('simple', 'abc_yyy.zzz'); ts_debug --- (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz}) (1 row) =# select ts_debug('simple', '1abc_yyy.zzz'); ts_debug --- (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz}) (1 row) In fact, the 123-yyy.zzz domain is not valid either according to the RFC (subdomain can't start with a digit), but since we already allow it, should we not allow 123_yyy.zzz to be recognized as a Host? Then why not recognize aaa@123_yyy.zzz as an email address? Another option is to prohibit underscore in recognized host names, but this has more breakage potential IMO. -- Alex It seems reasonable to me. I like more first option. But I am not confident that we should allow 123_yyy.zzz to be recognized as a Host. By the way, in this question http://webmasters.stackexchange.com/a/775 you can see examples of domain names with numbers (but not subdomains). If there are not objections from others, I will send a new patch today later or tomorrow with 123_yyy.zzz recognizing. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Generic WAL logical messages
I think here +const char * +logicalmsg_identify(uint8 info) +{ + if (info & ~XLR_INFO_MASK == XLOG_LOGICAL_MESSAGE) + return "MESSAGE"; + + return NULL; +} we should use brackets const char * logicalmsg_identify(uint8 info) { if ((info & ~XLR_INFO_MASK) == XLOG_LOGICAL_MESSAGE) return "MESSAGE"; return NULL; } Because of operator priorities http://en.cppreference.com/w/c/language/operator_precedence we may get errors. On 01.03.2016 00:10, Petr Jelinek wrote: Hi, attached is the newest version of the patch. I removed the registry, renamed the 'send' to 'emit', documented the callback parameters properly. I also added the test to ddl.sql for the serialization and deserialization (and of course found a bug there) and in general fixed all the stuff Andres reported. (see more inline) -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] unexpected result from to_tsvector
Hello, On 07.03.2016 23:55, Dmitrii Golub wrote: Hello, Should we added tests for this case? I think we should. I have added tests for teo...@123-stack.net and 1...@stack.net emails. 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of symbol "_" https://tools.ietf.org/html/rfc1035 page 8. Dmitrii Golub Thank you for the information. Fixed. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/wparser_def.c --- b/src/backend/tsearch/wparser_def.c *** *** 1121,1126 static const TParserStateActionItem actionTPS_InUnsignedInt[] = { --- 1121,1128 {p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL}, {p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL}, {p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL}, + {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, + {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL}, {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, *** a/src/test/regress/expected/tsearch.out --- b/src/test/regress/expected/tsearch.out *** *** 264,270 SELECT * FROM ts_token_type('default'); 23 | entity | XML entity (23 rows) ! SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234 wow < jqw <> qwerty'); tokid |token --- 264,270 23 | entity | XML entity (23 rows) ! SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net teo...@123-stack.net 1...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234 wow < jqw <> qwerty'); tokid |token *** *** 332,337 SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.w --- 332,341 12 | 4 | teo...@stack.net 12 | + 4 | teo...@123-stack.net + 12 | + 4 | 1...@stack.net + 12 | 16 | qwe-wer 11 | qwe 12 | - *** *** 404,425 SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.w 12 | 12 | <> 1 | qwerty ! (133 rows) ! SELECT to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234 wow < jqw <> qwerty'); !
Re: [HACKERS] Confusing with commit time usage in logical decoding
Hello, Andres You have introduced a large replication progress tracking infrastructure last year. And there is a problem described at the link in the quote below. Attached patch fix this issue. Is this patch correct? I will be grateful if it is and if it will be committed. Thanks. On 29.02.2016 14:18, Artur Zakirov wrote: Hello, I read this message http://www.postgresql.org/message-id/56d4197e.9050...@informatik.uni-kl.de Is this a bug or a typo? In DecodeCommit() in decode.c instead of: if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { origin_lsn = parsed->origin_lsn; commit_time = parsed->origin_timestamp; } should be: if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { origin_lsn = parsed->origin_lsn; commit_time = parsed->origin_timestamp; } else commit_time = parsed->xact_time; -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/replication/logical/decode.c --- b/src/backend/replication/logical/decode.c *** *** 458,463 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, --- 458,465 origin_lsn = parsed->origin_lsn; commit_time = parsed->origin_timestamp; } + else + commit_time = parsed->xact_time; /* * Process invalidation messages, even if we're not interested in the -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Phrase search ported to 9.6
on has the duplicated piece from the tsvector_setweight() from tsvector_op.c. You can define new function for it. +/* + * Check if datatype is the specified type or equivalent to it. + * + * Note: we could just do getBaseType() unconditionally, but since that's + * a relatively expensive catalog lookup that most users won't need, we + * try the straight comparison first. + */ +static bool +is_expected_type(Oid typid, Oid expected_type) +{ + if (typid == expected_type) + return true; + typid = getBaseType(typid); + if (typid == expected_type) + return true; + return false; +} + +/* Check if datatype is TEXT or binary-equivalent to it */ +static bool +is_text_type(Oid typid) +{ + /* varchar(n) and char(n) are binary-compatible with text */ + if (typid == TEXTOID || typid == VARCHAROID || typid == BPCHAROID) + return true; + /* Allow domains over these types, too */ + typid = getBaseType(typid); + if (typid == TEXTOID || typid == VARCHAROID || typid == BPCHAROID) + return true; + return false; +} These functions was removed in 9acb9007de30b3daaa9efc16763c3bc6e3e0a92d. It seems that tsvector_op.c was not synchronized with the master. Conclusion -- This patch is large and it needs more research. I will be reviewing it and will give another notes later. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Confusing with commit time usage in logical decoding
Hello, I read this message http://www.postgresql.org/message-id/56d4197e.9050...@informatik.uni-kl.de Is this a bug or a typo? In DecodeCommit() in decode.c instead of: if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { origin_lsn = parsed->origin_lsn; commit_time = parsed->origin_timestamp; } should be: if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { origin_lsn = parsed->origin_lsn; commit_time = parsed->origin_timestamp; } else commit_time = parsed->xact_time; -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Generic WAL logical messages
Hello, On 27.02.2016 03:05, Andres Freund wrote: Hi, I'm not really convinced by RegisterStandbyMsgPrefix() et al. There's not much documentation about what it actually is supposed to acomplish. Afaics you're basically forced to use shared_preload_libraries with it right now? Also, iterating through a linked list everytime something is logged doesn't seem very satisfying? I have did some tests with a simple plugin. I have used event triggers to send messages. It works, but I agree with Andres. We have problems if plugin is not loaded. For example, if you will execute the query: SELECT 'msg2' FROM pg_logical_send_message(false, 'test', 'msg2'); you will get the error (if plugin which should register a prefix is not loaded yet): ERROR: standby message prefix "test" is not registered Some stylistic note (logical.c): +static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr message_lsn, + bool transactional, const char *prefix, + Size sz, const char *message) +{ + LogicalDecodingContext *ctx = cache->private_data; + LogicalErrorCallbackState state; + ErrorContextCallback errcallback; It should be written in the following way: static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, XLogRecPtr message_lsn, bool transactional, const char *prefix, Size sz, const char *message) { LogicalDecodingContext *ctx = cache->private_data; LogicalErrorCallbackState state; ErrorContextCallback errcallback; -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types
On 21.02.2016 11:31, Pavel Stehule wrote: Hi I am sending updated version - the changes are related to fix comments. Great. I am new in reviewing, I think Pavel took into account all comments. This patch is compiled and regression tests are passed. So I change its status to "Ready for Committer". By the way, these functions are misnamed after this patch. They are called "wordtype" and "cwordtype" originally because they accept "word%TYPE" and "compositeword%TYPE", but after the patch they not only accept TYPE at the right of the percent sign but also ELEMENTTYPE and ARRAYTYPE. Not sure that this is something we want to be too strict about. Understand - used name ***reftype instead type I am not sure, but it seems that new names is a little worse. I think original names are good too. They accept a word and return the PLpgSQL_type structure. The "TYPE" word in this name was related to syntax %TYPE. And because new syntax allows more constructs, then the change name is correct. I am think. But choosing names is hard work. The new name little bit more strongly show relation to work with referenced types. Agree. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] unexpected result from to_tsvector
Hello, Here is a little patch. It fixes this issue http://www.postgresql.org/message-id/20160217080048.26357.49...@wrigleys.postgresql.org Without patch we get wrong result for the second email 't...@123-reg.ro': => SELECT * FROM ts_debug('simple', 't...@vauban-reg.ro'); alias | description | token| dictionaries | dictionary | lexemes ---+---++--++-- email | Email address | t...@vauban-reg.ro | {simple} | simple | {t...@vauban-reg.ro} (1 row) => SELECT * FROM ts_debug('simple', 't...@123-reg.ro'); alias | description| token | dictionaries | dictionary | lexemes ---+--++--++-- asciiword | Word, all ASCII | test | {simple} | simple | {test} blank | Space symbols| @ | {} || uint | Unsigned integer | 123| {simple} | simple | {123} blank | Space symbols| - | {} || host | Host | reg.ro | {simple} | simple | {reg.ro} (5 rows) After patch we get correct result for the second email: => SELECT * FROM ts_debug('simple', 't...@123-reg.ro'); alias | description | token | dictionaries | dictionary | lexemes ---+---+-+--++-- email | Email address | t...@123-reg.ro | {simple} | simple | {t...@123-reg.ro} (1 row) This patch allows to parser work with emails 't...@123-reg.ro', '1...@123-reg.ro' and 'test@123_reg.ro' correctly. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/wparser_def.c --- b/src/backend/tsearch/wparser_def.c *** *** 1121,1126 static const TParserStateActionItem actionTPS_InUnsignedInt[] = { --- 1121,1129 {p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL}, {p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL}, {p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL}, + {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, + {p_iseqC, '_', A_PUSH, TPS_InHostFirstAN, 0, NULL}, + {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL}, {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Generic WAL logical messages
On 23.01.2016 01:22, Petr Jelinek wrote: Hi, here is updated version of this patch, calling the messages logical (decoding) messages consistently everywhere and removing any connection to standby messages. Moving this to it's own module gave me place to write some brief explanation about this so the code documentation has hopefully improved as well. The functionality itself didn't change. Hello, It seems that you forgot regression tests for test_decoding. There is an entry in test_decoding/Makefile, but there are not files sql/messages.sql and expected/messages.out. However they are included in the first version of the patch. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types
It seems all fixes are done. I tested the patch and regression tests passed. On 27.01.2016 20:58, Pavel Stehule wrote: > --- 1681,1687 >* -- >*/ > PLpgSQL_type * > ! plpgsql_parse_wordtype(char *ident, int reftype_mode) > { > PLpgSQL_type *dtype; > PLpgSQL_nsitem *nse; Use the typedef'ed enum, as above. > --- 1699,1721 > switch (nse->itemtype) > { > case PLPGSQL_NSTYPE_VAR: > ! { > ! dtype = ((PLpgSQL_var *) (plpgsql_Datums[nse->itemno]))->datatype; > ! return derive_type(dtype, reftype_mode); > ! } > > ! case PLPGSQL_NSTYPE_ROW: > ! { > ! dtype = ((PLpgSQL_row *) (plpgsql_Datums[nse->itemno]))->datatype; > ! return derive_type(dtype, reftype_mode); > ! } > > + /* > + * XXX perhaps allow REC here? Probably it has not any sense, because > + * in this moment, because PLpgSQL doesn't support rec parameters, so > + * there should not be any rec polymorphic parameter, and any work can > + * be done inside function. > + */ I think you should remove from the "?" onwards in that comment, i.e. just keep what was already in the original comment (minus the ROW) I tried to fix it, not sure if understood well. I think here Alvaro means that you should keep original comment without the ROW. Like this: /* XXX perhaps allow REC here? */ > *** extern bool plpgsql_parse_dblword(char * > *** 961,968 > PLwdatum *wdatum, PLcword *cword); > extern bool plpgsql_parse_tripword(char *word1, char *word2, char *word3, > PLwdatum *wdatum, PLcword *cword); > ! extern PLpgSQL_type *plpgsql_parse_wordtype(char *ident); > ! extern PLpgSQL_type *plpgsql_parse_cwordtype(List *idents); > extern PLpgSQL_type *plpgsql_parse_wordrowtype(char *ident); > extern PLpgSQL_type *plpgsql_parse_cwordrowtype(List *idents); > extern PLpgSQL_type *plpgsql_build_datatype(Oid typeOid, int32 typmod, > --- 973,980 > PLwdatum *wdatum, PLcword *cword); > extern bool plpgsql_parse_tripword(char *word1, char *word2, char *word3, > PLwdatum *wdatum, PLcword *cword); > ! extern PLpgSQL_type *plpgsql_parse_wordtype(char *ident, int reftype_mode); > ! extern PLpgSQL_type *plpgsql_parse_cwordtype(List *idents, int reftype_mode); > extern PLpgSQL_type *plpgsql_parse_wordrowtype(char *ident); > extern PLpgSQL_type *plpgsql_parse_cwordrowtype(List *idents); > extern PLpgSQL_type *plpgsql_build_datatype(Oid typeOid, int32 typmod, By the way, these functions are misnamed after this patch. They are called "wordtype" and "cwordtype" originally because they accept "word%TYPE" and "compositeword%TYPE", but after the patch they not only accept TYPE at the right of the percent sign but also ELEMENTTYPE and ARRAYTYPE. Not sure that this is something we want to be too strict about. Understand - used name ***reftype instead type I am not sure, but it seems that new names is a little worse. I think original names are good too. They accept a word and return the PLpgSQL_type structure. Thank you for your comment Attached updated patch Regards Pavel -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services I noticed a little typo in the comment in the derive_type(): /* Return base_type, when it is a array already */ should be: /* Return base_type, when it is an array already */ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 12.02.2016 20:56, Teodor Sigaev wrote: On Thu, Feb 11, 2016 at 9:56 AM, Teodor Sigaev wrote: 1 - sml_limit to similarity_limit. sml_threshold is difficult to write I think, similarity_limit is more simple. It seems to me that threshold is right word by meaning. sml_threshold is my choice. Why abbreviate it like that? Nobody's going to know that "sml" stands for "similarity" without consulting the documentation, and that sucks. Ok, I don't have an objections. I worked a lot on various similarity modules and sml becomes usual for me. That's why I was asking. Hi! I attached new version of the patch. It fixes names of GUC variables and functions. Now the patch introduces: 1 - functions: - word_similarity() 2 - operators: - %> (and <%) - <->> (and <<->) 3 - GUC variables: - pg_trgm.similarity_threshold - pg_trgm.word_similarity_threshold -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/contrib/pg_trgm/pg_trgm--1.2.sql --- b/contrib/pg_trgm/pg_trgm--1.2.sql *** *** 3,13 --- 3,15 -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit + -- Deprecated function CREATE FUNCTION set_limit(float4) RETURNS float4 AS 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE; + -- Deprecated function CREATE FUNCTION show_limit() RETURNS float4 AS 'MODULE_PATHNAME' *** *** 26,32 LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on trgm_limit CREATE OPERATOR % ( LEFTARG = text, --- 28,34 CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on pg_trgm.similarity_threshold CREATE OPERATOR % ( LEFTARG = text, *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 105,111 typedef char *BITVECP; typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern float4 trgm_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); --- 105,111 typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern double similarity_threshold; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 206,212 gin_trgm_consistent(PG_FUNCTION_ARGS) * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 206,214 * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ! ((float4) ntrue) / ((float4) nkeys))) >= similarity_threshold) ! ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** *** 283,289 gin_trgm_triconsistent(PG_FUNCTION_ARGS) /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 285,293 /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : ! (float4) ntrue) / ((float4) nkeys)) >= similarity_threshold) ! ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** a/contrib/pg_trgm/trgm_gist.c --- b/contrib/pg_trgm/trgm_gist.c *** *** 294,300 gtrgm_consistent(PG_FUNCTION_ARGS) float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &trgm_limit || tmpsml > trgm_limit) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ --- 294,301 float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &similarity_threshold ! || tmpsml > similarity_threshold) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ *** *** 308,314 gtrgm_consistent(PG_FUNCTION_ARGS) if (len == 0) res = false; else ! res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false;
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 16.02.2016 18:14, Artur Zakirov wrote: I attached a new version of the patch. Sorry for noise. I attached new version of the patch. I saw mistakes in DecodeFlag(). This patch fix them. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change ! extensions to .affix and .dict. For some ! dictionary files it is also needed to convert characters to the UTF-8 ! encoding with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2721 + The .affix file of Ispell has the following + structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2736,2800 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following + structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are + listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the + word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/RT + large/RSPMYT + largehearted + + + MySpell does not support compound words. *** a/src/backend/tsearch/Makefile --- b/src/backend/tsearch/Makefile *** *** 13,20 include $(top_builddir)/src/Makefile.global DICTDIR=tsearch_data ! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ ! ispell_sample.affix ispell_sample.dict OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ --- 13,23 DICTDIR=
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
I attached a new version of the patch. On 10.02.2016 19:46, Teodor Sigaev wrote: I duplicate the patch here. it's very good thing to update disctionaries to support modern versions. And thank you for improving documentation. Also I've impressed by long description in spell.c header. Som notices about code: 1 struct SPELL. Why do you remove union p? You leave comment about using d struct instead of flag field and as can see it's right comment. It increases size of SPELL structure. Fixed. 2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be less or equal to size of integer. In opposite case, suppose, we can get undefined behavior. Please, split bitfields to two integers. Fixed. 3 unsigned char flagval[65000]; Is it forbidden to use 6 number? In any case, decodeFlag() doesn't restrict return value. I suggest to enlarge array to 1<<16 and add limit to return value of decodeFlag(). flagval array was enlarged. Added verification of return value of DecodeFlag() for for various FLAG parameter (FM_LONG, FM_NUM and FM_CHAR). 4 I'd like to see a short comment describing at least new functions Added some comments which describe new functions and old functions for loading dictionaries into PostgreSQL. This patch adds new functions and modifies functions which is used for loading dictionaries. At the moment, comments does not describe functions which used for word normalization. But I can add more comments. 5 Pls, add tests for new code. Added tests. Old sample dictionaries files was moved to the folder "dicts". New sample dictionaries files was added: - hunspell_sample_long.affix - hunspell_sample_long.dict - hunspell_sample_num.affix - hunspell_sample_num.dict -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_
Re: [HACKERS] commitfest problem ?
On 16.02.2016 17:50, Oleg Bartunov wrote: This entry https://commitfest.postgresql.org/8/419/ contains very unrelated patches from another commitfest. I think Oleg No, this is not commitfest problem. The part of the thread "[PROPOSAL] Improvements of Hunspell dictionaries support" (http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru) was moved to the thread "[PATCH] we have added support for box type in SP-GiST index" by mistake. I had noticed it too late. I was writing into the wrong thread. And now commitfest shows wrong last attachment. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IF (NOT) EXISTS in psql-completion
On 05.02.2016 11:09, Kyotaro HORIGUCHI wrote: Hello, I considered how to make tab-completion robust for syntactical noises, in other words, optional words in syntax. Typically "IF (NOT) EXISTS", UNIQUE and TEMPORARY are words that don't affect further completion. However, the current delimit-matching mechanism is not so capable (or is complexty-prone) to live with such noises. I have proposed to use regular expressions or simplified one for the robustness but it was too complex to be applied. This is another answer for the problem. Removal of such words on-the-fly makes further matching more robust. Next, currently some CREATE xxx subsyntaxes of CREATE SCHEMA are matched using TailMatching but it makes difficult the options-removal operations, which needs forward matching. So I introduced two things to resolve them by this patch. I did some tests with your patch. But I am not confident in tab-complete.c. And I have some notes: 1 - I execute git apply command and get the following warning: ../0001-Suggest-IF-NOT-EXISTS-for-tab-completion-of-psql.patch:302: trailing whitespace. /* warning: 1 line adds whitespace errors. This is because of superfluous whitespace I think. 2 - In psql I write "create table if" and press . psql adds the following: create table IF NOT EXISTS I think psql should continue with lower case if user wrote query with loser case text: create table if not exists 3 - Same with "IF EXISTS". If a write "alter view if" and press psql writes: alter view IF EXISTS -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 11.02.2016 16:35, Mike Rylander wrote: On Thu, Feb 11, 2016 at 8:11 AM, Teodor Sigaev wrote: I have attached a new version of the patch. It fixes error of operators <->> and %>: - operator <->> did not pass the regression test in CentOS 32 bit (gcc 4.4.7 20120313). - operator %> did not pass the regression test in FreeBSD 32 bit (gcc 4.2.1 20070831). It was because of variable optimization by gcc. Fixed with volatile modifier, right? I'm close to push this patches, but I still doubt in names, and I'd like to see comment from English speackers: 1 sml_limit GUC variable (options: similarity_limit, sml_threshold) 2 subword_similarity(). Actually, it finds most similar word (not substring!) from whole string. word_similarity? word_in_string_similarity? At least for this English speaker, substring_similarity is not confusing even if it's not internally accurate, but English is a strange language. Because I want the bike shed to be blue, how does query_string_similarity sound instead? If that's overly precise, then word_similarity would be fine with me. Thanks, -- Mike Rylander Great. I can change names: 1 - sml_limit to similarity_limit. sml_threshold is difficult to write I think, similarity_limit is more simple. 2 - subword_similarity() to word_similarity(). -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 11.02.2016 16:11, Teodor Sigaev wrote: I have attached a new version of the patch. It fixes error of operators <->> and %>: - operator <->> did not pass the regression test in CentOS 32 bit (gcc 4.4.7 20120313). - operator %> did not pass the regression test in FreeBSD 32 bit (gcc 4.2.1 20070831). It was because of variable optimization by gcc. Fixed with volatile modifier, right? Yes, it was fixes with volatile modifier. I'm close to push this patches, but I still doubt in names, and I'd like to see comment from English speackers: 1 sml_limit GUC variable (options: similarity_limit, sml_threshold) 2 subword_similarity(). Actually, it finds most similar word (not substring!) from whole string. word_similarity? word_in_string_similarity? substring_similarity_pos() could be a separate patch. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Thank you for the review. On 10.02.2016 19:46, Teodor Sigaev wrote: I duplicate the patch here. it's very good thing to update disctionaries to support modern versions. And thank you for improving documentation. Also I've impressed by long description in spell.c header. Som notices about code: 1 struct SPELL. Why do you remove union p? You leave comment about using d struct instead of flag field and as can see it's right comment. It increases size of SPELL structure. I will fix it. I had misunderstood the Alvaro's comment about it. 2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be less or equal to size of integer. In opposite case, suppose, we can get undefined behavior. Please, split bitfields to two integers. I will fix it. Here I had misunderstood too. 3 unsigned char flagval[65000]; Is it forbidden to use 6 number? In any case, decodeFlag() doesn't restrict return value. I suggest to enlarge array to 1<<16 and add limit to return value of decodeFlag(). I think it can be done. 4 I'd like to see a short comment describing at least new functions Now in spell.c there are more comments. I wanted to send fixed patch after adding all comments that I want to add. But I can send the patch now. Also I will merge this commit http://www.postgresql.org/message-id/e1atf9o-0001ga...@gemulon.postgresql.org 5 Pls, add tests for new code. I will add. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 11.02.2016 03:33, Tom Lane wrote: Artur Zakirov writes: [ tsearch_aff_parse_v1.patch ] I've pushed this with some corrections --- notably, I did not like the lack of buffer-overrun prevention, and it did the wrong thing if a line had more than one trailing space character. We still need to look at other uses of *scanf(), but I think that any other changes that might be needed should be separate patches anyhow. regards, tom lane Thank you! -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 11.02.2016 01:19, Tom Lane wrote: I wrote: Artur Zakirov writes: I think this is not a bug. It is a normal behavior. In Mac OS sscanf() with the %s format reads the string one character at a time. The size of letter 'Ñ…' is 2. And sscanf() separate it into two wrong characters. That argument might be convincing if OSX behaved that way for all multibyte characters, but it doesn't seem to be doing that. Why is only 'Ñ…' affected? I looked into the OS X sources, and found that indeed you are right: *scanf processes the input a byte at a time, and applies isspace() to each byte separately, even when the locale is such that that's a clearly insane thing to do. Since this code was derived from FreeBSD, FreeBSD has or once had the same issue. (A look at the freebsd project on github says it still does, assuming that's the authoritative repo.) Not sure about other BSDen. I also verified that in UTF8-based locales, isspace() thinks that 0x85 and 0xA0, and no other high-bit-set values, are spaces. Not sure exactly why it thinks that, but that explains why 'Ñ…' fails when adjacent code points don't. So apparently the coding rule we have to adopt is "don't use *scanf() on data that might contain multibyte characters". (There might be corner cases where it'd work all right for conversion specifiers other than %s, but probably you might as well just use strtol and friends in such cases.) Ugh. regards, tom lane Yes, I meant this. The second byte divides the word into two wrong pieces. Sorry for my unclear explanation. I should to explain more clearly. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 10.02.2016 18:51, Teodor Sigaev wrote: Hmm. Here src/backend/access/transam/xlog.c read_tablespace_map() using %s in scanf looks suspisious. I don't fully understand but it looks like it tries to read oid as string. So, it should be safe in usial case Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could file name be in UTF-8 encoding here? This function reads the "blobs.toc" file. It lines have the following format: Where is blob_.dat. Therefore it should be safe too. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 02.02.2016 15:45, Artur Zakirov wrote: On 01.02.2016 20:12, Artur Zakirov wrote: I have changed the patch: 1 - trgm2.data was corrected, duplicates were deleted. 2 - I have added operators <<-> and <->> with GiST index supporting. A regression test will pass only with the patch http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com 3 - the function substring_similarity() was renamed to subword_similarity(). But there is not a function substring_similarity_pos() yet. It is not trivial. Sorry, in the previous patch was a typo. Here is the fixed patch. I have attached a new version of the patch. It fixes error of operators <->> and %>: - operator <->> did not pass the regression test in CentOS 32 bit (gcc 4.4.7 20120313). - operator %> did not pass the regression test in FreeBSD 32 bit (gcc 4.2.1 20070831). It was because of variable optimization by gcc. In this patch pg_trgm documentation was corrected. Now operators were wrote as %> and <->> (not <% and <<->). There is a problem in adding the substring_similarity_pos() function. It can bring additional overhead. Because we need to store characters position including spaces in addition. Spaces between words are lost in current implementation. Does it actually need? In conclusion, this patch introduces: 1 - functions: - subword_similarity() 2 - operators: - %> - <->> 3 - GUC variables: - pg_trgm.sml_limit - pg_trgm.subword_limit -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/contrib/pg_trgm/pg_trgm--1.2.sql --- b/contrib/pg_trgm/pg_trgm--1.2.sql *** *** 3,13 --- 3,15 -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit + -- Deprecated function CREATE FUNCTION set_limit(float4) RETURNS float4 AS 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE; + -- Deprecated function CREATE FUNCTION show_limit() RETURNS float4 AS 'MODULE_PATHNAME' *** *** 26,32 LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on trgm_limit CREATE OPERATOR % ( LEFTARG = text, --- 28,34 CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on pg_trgm.sml_limit CREATE OPERATOR % ( LEFTARG = text, *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 105,111 typedef char *BITVECP; typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern float4 trgm_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); --- 105,111 typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern double trgm_sml_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 206,212 gin_trgm_consistent(PG_FUNCTION_ARGS) * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 206,213 * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** *** 283,289 gin_trgm_triconsistent(PG_FUNCTION_ARGS) /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 284,291 /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : ! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** a/contrib/pg_trgm/trgm_gist.c --- b/contrib/pg_trgm/trgm_gist.c *** *** 294,300 gtrgm_consistent(PG_FUNCTION_ARGS) float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &trgm_limit || tmpsml > trgm_limit) ? true : false; } else if (ISALLTRUE(key))
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 09.02.2016 20:13, Tom Lane wrote: I do not like this patch much. It is basically "let's stop using sscanf() because it seems to have a bug on one platform". There are at least two things wrong with that approach: 1. By my count there are about 80 uses of *scanf() in our code. Are we going to replace every one of them with hand-rolled code? If not, why is only this instance vulnerable? How can we know whether future uses will have a problem? It seems that *scanf() with %s format occures only here: - check.c - get_bin_version() - server.c - get_major_server_version() - filemap.c - isRelDataFile() - pg_backup_directory.c - _LoadBlobs() - xlog.c - do_pg_stop_backup() - mac.c - macaddr_in() I think here sscanf() do not works with the UTF-8 characters. And probably this is only spell.c issue. I agree that previous patch is wrong. Instead of using new parse_ooaffentry() function maybe better to use sscanf() with %ls format. The %ls format is used to read a wide character string. 2. We're not being very good citizens of the software universe if we just install a hack in Postgres rather than nagging Apple to fix the bug at its true source. I think the appropriate next step to take is to dig into the OS X sources (see http://www.opensource.apple.com, I think probably the relevant code is in the Libc package) and identify exactly what is causing the misbehavior. That would both allow an informed answer to point #1 and greatly increase the odds of getting action on a bug report to Apple. Even if we end up applying this patch verbatim, I think we need that information first. regards, tom lane I think this is not a bug. It is a normal behavior. In Mac OS sscanf() with the %s format reads the string one character at a time. The size of letter 'х' is 2. And sscanf() separate it into two wrong characters. In conclusion, I think in spell.c should be used sscanf() with %ls format. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 01.02.2016 20:12, Artur Zakirov wrote: I have changed the patch: 1 - trgm2.data was corrected, duplicates were deleted. 2 - I have added operators <<-> and <->> with GiST index supporting. A regression test will pass only with the patch http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com 3 - the function substring_similarity() was renamed to subword_similarity(). But there is not a function substring_similarity_pos() yet. It is not trivial. Sorry, in the previous patch was a typo. Here is the fixed patch. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/contrib/pg_trgm/pg_trgm--1.2.sql --- b/contrib/pg_trgm/pg_trgm--1.2.sql *** *** 3,13 --- 3,15 -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit + -- Deprecated function CREATE FUNCTION set_limit(float4) RETURNS float4 AS 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE; + -- Deprecated function CREATE FUNCTION show_limit() RETURNS float4 AS 'MODULE_PATHNAME' *** *** 26,32 LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on trgm_limit CREATE OPERATOR % ( LEFTARG = text, --- 28,34 CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on pg_trgm.sml_limit CREATE OPERATOR % ( LEFTARG = text, *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 105,111 typedef char *BITVECP; typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern float4 trgm_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); --- 105,111 typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern double trgm_sml_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 206,212 gin_trgm_consistent(PG_FUNCTION_ARGS) * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 206,213 * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** *** 283,289 gin_trgm_triconsistent(PG_FUNCTION_ARGS) /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 284,291 /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : ! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** a/contrib/pg_trgm/trgm_gist.c --- b/contrib/pg_trgm/trgm_gist.c *** *** 294,300 gtrgm_consistent(PG_FUNCTION_ARGS) float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &trgm_limit || tmpsml > trgm_limit) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ --- 294,301 float4 tmpsml = cnt_sml(key, qtrg); /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ ! res = (*(int *) &tmpsml == *(int *) &trgm_sml_limit ! || tmpsml > trgm_sml_limit) ? true : false; } else if (ISALLTRUE(key)) { /* non-leaf contains signature */ *** *** 308,314 gtrgm_consistent(PG_FUNCTION_ARGS) if (len == 0) res = false; else ! res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false; } break; case ILikeStrategyNumber: --- 309,315 if (len == 0) res = false; else ! res = (float8) count) / ((float8) len))) >= trgm_sml_limit) ? true : false; } break; case ILikeStrategyNumber: *** a/contrib/pg_trgm/trgm_op.c --- b/contrib/pg_trgm/trgm_op.c *** *** 14,20 PG_MODULE_MAGIC; ! float4 trgm_limit = 0.3f; PG_FUNC
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 29.01.2016 18:58, Artur Zakirov wrote: On 29.01.2016 18:39, Alvaro Herrera wrote: Teodor Sigaev wrote: The behavior of this function is surprising to me. select substring_similarity('dog' , 'hotdogpound') ; substring_similarity -- 0.25 Substring search was desined to search similar word in string: contrib_regression=# select substring_similarity('dog' , 'hot dogpound') ; substring_similarity -- 0.75 contrib_regression=# select substring_similarity('dog' , 'hot dog pound') ; substring_similarity -- 1 Hmm, this behavior looks too much like magic to me. I mean, a substring is a substring -- why are we treating the space as a special character here? I think, I can rename this function to subword_similarity() and correct the documentation. The current behavior is developed to find most similar word in a text. For example, if we will search just substring (not word) then we will get the following result: select substring_similarity('dog', 'dogmatist'); substring_similarity - 1 (1 row) But this is wrong I think. They are completely different words. For searching a similar substring (not word) in a text maybe another function should be added? I have changed the patch: 1 - trgm2.data was corrected, duplicates were deleted. 2 - I have added operators <<-> and <->> with GiST index supporting. A regression test will pass only with the patch http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com 3 - the function substring_similarity() was renamed to subword_similarity(). But there is not a function substring_similarity_pos() yet. It is not trivial. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/contrib/pg_trgm/pg_trgm--1.2.sql --- b/contrib/pg_trgm/pg_trgm--1.2.sql *** *** 3,13 --- 3,15 -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit + -- Deprecated function CREATE FUNCTION set_limit(float4) RETURNS float4 AS 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE; + -- Deprecated function CREATE FUNCTION show_limit() RETURNS float4 AS 'MODULE_PATHNAME' *** *** 26,32 LANGUAGE C STRICT IMMUTABLE; CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on trgm_limit CREATE OPERATOR % ( LEFTARG = text, --- 28,34 CREATE FUNCTION similarity_op(text,text) RETURNS bool AS 'MODULE_PATHNAME' ! LANGUAGE C STRICT STABLE; -- stable because depends on pg_trgm.sml_limit CREATE OPERATOR % ( LEFTARG = text, *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 105,111 typedef char *BITVECP; typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern float4 trgm_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); --- 105,111 typedef struct TrgmPackedGraph TrgmPackedGraph; ! extern double trgm_sml_limit; extern uint32 trgm2int(trgm *ptr); extern void compact_trigram(trgm *tptr, char *str, int bytelen); *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 206,212 gin_trgm_consistent(PG_FUNCTION_ARGS) * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 206,213 * similarity is just c / len1. * So, independly on DIVUNION the upper bound formula is the same. */ ! res = (nkeys == 0) ? false : ! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** *** 283,289 gin_trgm_triconsistent(PG_FUNCTION_ARGS) /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE --- 284,291 /* * See comment in gin_trgm_consistent() about * upper bound formula */ ! res = (nkeys == 0) ? GIN_FALSE : ! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE); break; case ILikeStrategyNumber: #ifndef IGNORECASE *** a/contrib/pg_trgm/trgm_gist.c --- b/contr
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 29.01.2016 18:39, Alvaro Herrera wrote: Teodor Sigaev wrote: The behavior of this function is surprising to me. select substring_similarity('dog' , 'hotdogpound') ; substring_similarity -- 0.25 Substring search was desined to search similar word in string: contrib_regression=# select substring_similarity('dog' , 'hot dogpound') ; substring_similarity -- 0.75 contrib_regression=# select substring_similarity('dog' , 'hot dog pound') ; substring_similarity -- 1 Hmm, this behavior looks too much like magic to me. I mean, a substring is a substring -- why are we treating the space as a special character here? I think, I can rename this function to subword_similarity() and correct the documentation. The current behavior is developed to find most similar word in a text. For example, if we will search just substring (not word) then we will get the following result: select substring_similarity('dog', 'dogmatist'); substring_similarity - 1 (1 row) But this is wrong I think. They are completely different words. For searching a similar substring (not word) in a text maybe another function should be added? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 29.01.2016 17:15, Teodor Sigaev wrote: The behavior of this function is surprising to me. select substring_similarity('dog' , 'hotdogpound') ; substring_similarity -- 0.25 Substring search was desined to search similar word in string: contrib_regression=# select substring_similarity('dog' , 'hot dogpound') ; substring_similarity -- 0.75 contrib_regression=# select substring_similarity('dog' , 'hot dog pound') ; substring_similarity -- 1 It seems to me that users search words in long string. But I'm agree that more detailed explanation needed and, may be, we need to change feature name to fuzzywordsearch or something else, I can't imagine how. Thank you for the review. I will rename the function name. Maybe to subword_similarity()? Also, should we have a function which indicates the position in the 2nd string at which the most similar match to the 1st argument occurs? select substring_similarity_pos('dog' , 'hotdogpound') ; answering: 4 Interesting, I think, it will be useful in some cases. We could call them <<-> and <->> , where the first corresponds to <% and the second to %> Agree I will add them. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 28.01.2016 17:42, Artur Zakirov wrote: On 27.01.2016 15:28, Artur Zakirov wrote: On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas The test program was corrected. Now it uses wchar_t type. And it works correctly and gives right output. I think the NIImportOOAffixes() in spell.c should be corrected to avoid this bug. I have attached a patch. It adds new functions parse_ooaffentry() and get_nextentry() and fixes a couple comments. Now russian and other supported dictionaries can be used for text search in Mac OS. parse_ooaffentry() parses an affix file entry instead of sscanf(). It has a similar algorithm to the parse_affentry() function. Should I create a new patch to fix this bug (as I did) or this patch should go with the patch http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru ? I have created a new entry in the commitfest for this patch https://commitfest.postgresql.org/9/496/ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 21.01.2016 00:25, Alvaro Herrera wrote: Artur Zakirov wrote: I don't quite understand why aren't we using a custom GUC variable here. These already have SHOW and SET support ... Added GUC variables: - pg_trgm.limit - pg_trgm.substring_limit I added this variables to the documentation. show_limit() and set_limit() functions work correctly and they are marked as deprecated. Thanks. I'm willing to commit quickly a small patch that only changes the existing function to GUCs, then have a look at a separate patch that adds the new substring operator. Would you split the patch that way? What status of this patch? In commitfest it is "Needs review". Can this patch get the status "Ready for Commiter"? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 27.01.2016 15:28, Artur Zakirov wrote: On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas The test program was corrected. Now it uses wchar_t type. And it works correctly and gives right output. I think the NIImportOOAffixes() in spell.c should be corrected to avoid this bug. I have attached a patch. It adds new functions parse_ooaffentry() and get_nextentry() and fixes a couple comments. Now russian and other supported dictionaries can be used for text search in Mac OS. parse_ooaffentry() parses an affix file entry instead of sscanf(). It has a similar algorithm to the parse_affentry() function. Should I create a new patch to fix this bug (as I did) or this patch should go with the patch http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru ? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 458,469 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c } #define PAE_WAIT_MASK 0 ! #define PAE_INMASK 1 #define PAE_WAIT_FIND 2 ! #define PAE_INFIND 3 #define PAE_WAIT_REPL 4 ! #define PAE_INREPL 5 static bool parse_affentry(char *str, char *mask, char *find, char *repl) { --- 458,579 } #define PAE_WAIT_MASK 0 ! #define PAE_INMASK 1 #define PAE_WAIT_FIND 2 ! #define PAE_INFIND 3 #define PAE_WAIT_REPL 4 ! #define PAE_INREPL 5 ! #define PAE_WAIT_TYPE 6 ! #define PAE_WAIT_FLAG 7 + /* + * Used in parse_ooaffentry() to parse an .affix file entry. + */ + static bool + get_nextentry(char **str, char *next) + { + int state = PAE_WAIT_MASK; + char *pnext = next; + + *next = '\0'; + + while (**str) + { + if (state == PAE_WAIT_MASK) + { + if (t_iseq(*str, '#')) + return false; + else if (!t_isspace(*str)) + { + COPYCHAR(pnext, *str); + pnext += pg_mblen(*str); + state = PAE_INMASK; + } + } + else if (state == PAE_INMASK) + { + if (t_isspace(*str)) + { + *pnext = '\0'; + return true; + } + else + { + COPYCHAR(pnext, *str); + pnext += pg_mblen(*str); + } + } + *str += pg_mblen(*str); + } + + *pnext ='\0'; + + return *next; + } + + /* + * Parses entry of an .affix file of MySpell or Hunspell format. + * + * An .affix file entry has the following format: + * - header + * + * - fields after header: + * + */ + static int + parse_ooaffentry(char *str, char *type, char *flag, char *find, + char *repl, char *mask) + { + int state = PAE_WAIT_TYPE, + next_state = PAE_WAIT_FLAG; + int parse_read = 0; + bool valid = true; + + *type = *flag = *find = *repl = *mask = '\0'; + + while (*str && valid) + { + switch (state) + { + case PAE_WAIT_TYPE: + valid = get_nextentry(&str, type); + break; + case PAE_WAIT_FLAG: + valid = get_nextentry(&str, flag); + next_state = PAE_WAIT_FIND; + break; + case PAE_WAIT_FIND: + valid = get_nextentry(&str, find); + next_state = PAE_WAIT_REPL; + break; + case PAE_WAIT_REPL: + valid = get_nextentry(&str, repl); + next_state = PAE_WAIT_MASK; + break; + case PAE_WAIT_MASK: + get_nextentry(&str, mask); + /* break loop */ + valid = false; + break; + default: + elog(ERROR, "unrecognized state in parse_ooaffentry: %d", state); + } + state = next_state; + if (*str) + str += pg_mblen(str); + + parse_read++; + } + + return parse_read; + } + + /* + * Parses entry of an .affix file of Ispell format + * + * An .affix file entry has the following format: + * > [-,] + */ static bool parse_affentry(char *str, char *mask, char *find, char *repl) { *** *** 618,625 NIImportOOAffixes(IspellDict *Conf, const char *filename) int flag = 0; char flagflags = 0; tsearch_readline_state trst; ! int scanread = 0; ! char scanbuf[BUFSIZ]; char *recoded; /* read file to find any flag */ --- 728,734 int flag = 0; char flagflags = 0; tsearch_readline_state trst; ! int parseread = 0; char *recoded; /* read file to find any flag */ *** *** 682,689 NIImportOOAffixes(IspellDict *Conf, const char *filename) } tsearch_readline_end(&trst); - sprintf(scanbuf, "%%6s %%%ds %%%ds %%%ds %%%ds", BUFSIZ / 5, BUFSIZ / 5, BUFSIZ / 5, BUFSIZ / 5); - if (!tsearch_readline_begin(&trst, filename)) ereport(ERROR, (errcode(ERRCODE_CONFIG_FILE_ERROR), --- 791,796 *** *** 695,709 NIImportOOAffixes(IspellDict *Conf, const char *filename) if
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Sorry, I don't know why this thread was moved to another thread. I duplicate the patch here. On 28.01.2016 14:19, Alvaro Herrera wrote: Artur Zakirov wrote: I undo the changes and the error will be raised. I will update the patch soon. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. I'm working on the patch. I wanted to send this changes after all changes. This version of the patch has a top-level comment. Another comments I will provides soon. Also this patch has some changes with ternary operators. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. Moved to next CF. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2735,2796 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 28.01.2016 14:19, Alvaro Herrera wrote: Artur Zakirov wrote: I undo the changes and the error will be raised. I will update the patch soon. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. I'm working on the patch. I wanted to send this changes after all changes. This version of the patch has a top-level comment. Another comments I will provides soon. Also this patch has some changes with ternary operators. > I don't think you ever did this. I'm closing it now, but it sounds > useful stuff so please do resubmit for 2016-03. Moved to next CF. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2735,2796 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/RT + large/RSPMYT + largehearted + + + MySpell does not support compound words.
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 27.01.2016 14:14, Stas Kelvich wrote: Hi. I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =) That can be reproduced with simpler test Stas The test program was corrected. Now it uses wchar_t type. And it works correctly and gives right output. I think the NIImportOOAffixes() in spell.c should be corrected to avoid this bug. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company #include #include #include char *src = "SFX Y Ñ Ð°ÑÑÑÑ ÑÑÑÑÑÑ Ð°ÑÑÑÑ"; int main(int argc, char *argv[]) { wchar_t c1[1024], c2[1024], c3[1024], c4[1024], c5[1024]; setlocale(LC_CTYPE, "ru_RU.UTF-8"); sscanf(src, "%6ls %204ls %204ls %204ls %204ls", c1, c2, c3, c4, c5); printf("%ls/%ls/%ls/%ls/%ls\n", c1, c2, c3, c4, c5); return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
On 27.01.2016 13:46, Shulgin, Oleksandr wrote: Not sure why the file uses "SET KOI8-R" directive then? This directive is used only by Hunspell program. PostgreSQL ignores this directive and assumes that input affix and dictionary files in the UTF-8 encoding. What error message do you get with this test program? (I don't get any, but I'm not on Mac OS.) -- Alex With this program you will get wrong output. A error message is not called. You can execute the following commands: > cc test.c -o test > ./test You will get the output: SFX/Y/?/аться/шутся Although the output should be: SFX/Y/хаться/шутся/хаться -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"
Hello. When a user try to create a text search dictionary for the russian language on Mac OS then called the following error message: CREATE EXTENSION hunspell_ru_ru; + ERROR: invalid byte sequence for encoding "UTF8": 0xd1 + CONTEXT: line 341 of configuration file "/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": "SFX Y хаться шутсяхаться Russian dictionary was downloaded from http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian Affix and dictionary files was extracted from the archive and converted to UTF-8. Also a converted dictionary can be downloaded from https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru This behavior occurs on: - Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan. - latest PostgreSQL version from git and PostgreSQL 9.5 (probably also on 9.4.5). There is also the test to reproduce this bug in the attachment. Did you meet this bug? Do you have a solution or a workaround? Thanks in advance. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company #include #include char *src = "SFX Y Ñ Ð°ÑÑÑÑ ÑÑÑÑÑÑ Ð°ÑÑÑÑ"; int main(int argc, char *argv[]) { char c1[1024], c2[1024], c3[1024], c4[1024], c5[1024]; setlocale(LC_CTYPE, "ru_RU.UTF-8"); sscanf(src, "%6s %204s %204s %204s %204s", c1, c2, c3, c4, c5); printf("%s/%s/%s/%s/%s\n", c1, c2, c3, c4, c5); return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] easy way of copying regex_t
On 25.01.2016 13:07, Tomas Vondra wrote: Right, it's definitely not thread-safe so there'd need to be some lock protecting the regex_t copy. I was thinking about either using a group of locks, each protecting a small subset of the affixes (thus making it possible to work in parallel to some extent), or simply using a single lock and each process would make a private copy at the beginning. In the end, I've decided to do it differently, and simply parse the affix list from scratch in each process. The affix list is tiny and takes less than a millisecond to parse in most cases, and I don't have to care about the regex stuff at all. The main benefit is from sharing parsed wordlist anyway. This is nice decision since the affix list is small. For our task I will change shared_ispell to use this solution. It's an old-school shared segment created by the extension at init time. You're right the size is fixed so it's possible to run out of space by loading too many dictionaries, but that was not a big deal for the type of setups it was designed for - in those cases the list of dictionaries is stable, so it's possible to size the segment accordingly in advance. But I guess we could do better now that we have dynamic shared memory, possibly allocating one segment per dictionary as needed, or something like that. regards Yes it would be better as we will not need to define the maximum size of the shared segment in postgresql.conf. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] easy way of copying regex_t
Hi all, I've been working on moving an extension that allows to move the ispell dictionaries to shared segment. It's almost complete, the last FIXME is about copying regex_t structure (stored in AFFIX). According to regex.h the structure is fairly complex and not exactly easy to understand, so I'd like to know if anyone here already implemented that or something that may serve the same purpose. Any ideas? kind regards Tomas This message is the reply to the message http://www.postgresql.org/message-id/dd02a31fdeffbf5cb24771e34213b40f.squir...@sq.gransy.com Sorry, I can't reply to it directly. I can't get it from archive. Thank you for your extension shared_ispell. It is very useful. I have got it from https://github.com/tvondra/shared_ispell With this message I want to send some patch to your repository with draft of a code, which allows shared_ispell to copy regex_t. The main idea of the patch is: - we doesn't need copy all regex_t structure - most of fields and structures used only in a compile time - we need copy structures: guts, colormap, subre, cnfa - from the subre structure we need only cnfa colormap represents a directed acyclic graph. cnfa represents a nondeterministic finite automaton. In this patch also was done the following: - added regression tests - deleted spell.h and spell.c since they have duplicate code - added shared_ispell.h which declares some structures - fix an issue when stopFile can be NULL - fixed countCMPDAffixes since theoretically could be zero affix - added copyCMPDAffix Question to hackers. Can such patch be useful as a PostgreSQL patch to Full-Text search? Is it needed? shared_ispell loads dictionaries into a shared memory. The main benefits are: - saving of memory. Every dictionary is loaded only once. Dictionaries are not loaded for each backend. In current version of PostgreSQL dictionaires are loaded for each backend where it was requested. - saving of time. The first load of a dictionary takes much time. With this patch dictionaries will be loaded only once. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/Makefile --- b/Makefile *** *** 1,18 MODULE_big = shared_ispell ! OBJS = src/shared_ispell.o src/spell.o EXTENSION = shared_ispell ! DATA = sql/shared_ispell--1.0.0.sql ! MODULES = shared_ispell ! CFLAGS=`pg_config --includedir-server` PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) ! ! all: shared_ispell.so ! ! shared_ispell.so: $(OBJS) ! ! %.o : src/%.c --- 1,20 + # contrib/shared_ispell/Makefile + MODULE_big = shared_ispell ! OBJS = src/shared_ispell.o EXTENSION = shared_ispell ! DATA = sql/shared_ispell--1.1.0.sql ! REGRESS = shared_ispell + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) ! else ! subdir = contrib/shared_ispell ! top_builddir = ../.. ! include $(top_builddir)/src/Makefile.global ! include $(top_srcdir)/contrib/contrib-global.mk ! endif *** a/README.md --- b/README.md *** *** 13,28 If you need just snowball-type dictionaries, this extension is not really interesting for you. But if you really need an ispell dictionary, this may save you a lot of resources. - Warning - --- - The extension does not yet handle affixes that require full regular - expressions (regex_t, implemented in regex.h). This is indicated by - an error when initializing the dictionary. - - Simple affixes and affixes that can be handled by fast regex subset - (as implemented in regis.h) are handled just fine. - - Install --- Installing the extension is quite simple, especially if you're on 9.1. --- 13,18 *** /dev/null --- b/expected/shared_ispell.out *** *** 0 --- 1,193 + CREATE EXTENSION shared_ispell; + -- Test ISpell dictionary with ispell affix file + CREATE TEXT SEARCH DICTIONARY shared_ispell ( + Template=shared_ispell, + DictFile=ispell_sample, + AffFile=ispell_sample + ); + SELECT ts_lexize('shared_ispell', 'skies'); + ts_lexize + --- + {sky} + (1 row) + + SELECT ts_lexize('shared_ispell', 'bookings'); +ts_lexize + + {booking,book} + (1 row) + + SELECT ts_lexize('shared_ispell', 'booking'); +ts_lexize + + {booking,book} + (1 row) + + SELECT ts_lexize('shared_ispell', 'foot'); + ts_lexize + --- + {foot} + (1 row) + + SELECT ts_lexize('shared_ispell', 'foots'); + ts_lexize + --- + {foot} + (1 row) + + SELECT ts_lexize('shared_ispell', 'rebookings'); +ts_lexize + + {booking,book} + (1 row) + + SELECT ts_lexize('shared_ispell', 'rebooking'); +
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 12.01.2016 02:31, Alvaro Herrera wrote: I gave a quick look through the patch and noticed a few minor things while trying to understand it. I think the test corpus isn't particularly interesting for how big it is. I'd rather have (a) a small corpus (say 100 words) with which to do detailed regression testing, and (b) some larger document for more extensive testing. I'm not sure (b) is actually necessary. Overall I think the new functions could stand a lot more commentary. Thank you for a review. I will send fixed patch in a few days. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 09.01.2016 05:38, Alvaro Herrera wrote: Artur Zakirov wrote: Now almost all dictionaries are loaded into PostgreSQL. But the da_dk dictionary does not load. I see the following error: ERROR: invalid regular expression: quantifier operand invalid CONTEXT: line 439 of configuration file "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s +GENITIV If you open the affix file in editor you can see that there is incorrect format of the affix 55 in 439 line (screen1.png): [ another email ] I also had implemented a patch that fixes an error from the e-mail http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru This patch just ignore that error. I think it's a bad idea to just ignore these syntax errors. This affix file is effectively corrupt, after all, so it seems a bad idea that we need to cope with it. I think it would be better to raise the error normally and instruct the user to fix the file; obviously it's better if the upstream provider of the file fixes it. Now, if there is proof somewhere that the file is correct, then the code must cope in some reasonable way. But in any case I don't think this change is acceptable ... it can only cause pain, in the long run. This error is raised in Danish dictionary because of erroneous entry in the .affix file. I sent a bug-report to developer. He fixed this bug. Corrected dictionary can be downloaded from LibreOffice site. I undo the changes and the error will be raised. I will update the patch soon. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Thanks for review. On 09.01.2016 02:04, Alvaro Herrera wrote: Artur Zakirov wrote: --- 74,80 typedef struct aff_struct { ! uint32 flag:16, type:1, flagflags:7, issimple:1, By doing this, you're using 40 bits of a 32-bits-wide field. What does this mean? Are the final 8 bits lost? Does the compiler allocate a second uint32 member for those additional bits? I don't know, but I don't think this is a very clean idea. No, 8 bits are not lost. This 8 bits are used if a dictionary uses double extended ASCII character flag type (Conf->flagMode == FM_LONG) or decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary uses single extended ASCII character flag type (Conf->flagMode == FM_CHAR), then 8 bits lost. You can see it in decodeFlag function. This function is used in NIImportOOAffixes function, decode affix flag from string type and store in flag field (flag:16). typedef struct spell_struct { ! struct { ! int affix; ! int len; ! } d; ! /* !* flag is filled in by NIImportDictionary. After NISortDictionary, d !* is valid and flag is invalid. !*/ ! char *flag; charword[FLEXIBLE_ARRAY_MEMBER]; } SPELL; Here you removed the union, with no rationale for doing so. Why did you do it? Can it be avoided? Because of the comment, I'd imagine that d and flag are valid at different times, so at any time we care about only one of them; but you haven't updated the comment stating the reason for that no longer to be the case. I suspect you need to keep flag valid after NISortDictionary has been called, but if so why? If "flag" is invalid as the comment says, what's the reason for keeping it? Union was removed because the flag field need to be dynamically sized. It had 16 size in the previous version. In this field flag set are stored. For example, if .dict file has the entry: mitigate/NDSGny Then the "NDSGny" is stored in the flag field. But in some cases a flag set can have size bigger than 16. I added this changes after this message http://www.postgresql.org/message-id/CAE2gYzwom3=11u9g8zxmt5plkzrwb12bwzxh4db3hud89fo...@mail.gmail.com In that Turkish dictionary there are can be large flag set. For example: özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409 This flag set has 56 size. This "flag" is valid all the time. It is used in NISortDictionary and it is not used after NISortDictionary function has been called. Maybe you right and there are no reason for keeping it, and it is necessary to store all flags in separate variable, that will be deleted after NISortDictionary has been called. The routines decodeFlag and isAffixFlagInUse could do with more comments. Your patch adds zero. Actually the whole file has not nearly enough comments; adding some more would be very good. Actually, after some more reading, I think this code is pretty terrible. I have a hard time figuring out how the original works, which makes it even more difficult to figure out whether your changes make sense. I would have to take your patch on faith, which doesn't sound so great an idea. palloc / cpalloc / tmpalloc make the whole mess even more confusing. Why does this file have three ways to allocate memory? Not sure what's a good way to go about this. I am certainly not going to commit this as is, because if I do whatever bugs you have are going to become my problem; and with the severe lack of documentation and given how fiddly this stuff is, I bet there are going to be a bunch of bugs. I suspect most committers are going to be in the same position. I think you should start by adding a few comments here and there on top of the original to explain how it works, then your patch on top. I suppose it's going to be a lot of work for you but I don't see any other way. A top-level overview about it would be good, too. The current comment at top of file states: * spell.c * Normalizing word with ISpell which is, err, somewhat laconic. I will provide comments and explain how it works in comments. Maybe I will add some explanation about dictionaries structure. I will update the patch soon. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types
Hi. I have tried to do some review of this patch. Below are my comments. Introduction: This patch fixes and adds the following functionality: - %TYPE - now can be used for composite types. - %ARRAYTYPE - new functionality, provides the array type from a variable or table column. - %ELEMENTTYPE - new funcitonality, provides the element type of a given array. New regression tests are included in the patch. Changes to the documentation are not provided. Initial Run: The patch applies correctly to HEAD. Regression tests pass successfully, without errors. It seems that the patch work as you expected. Performance: It seems patch have not possible performance issues for the existing functionality. Coding: The style looks fine. I attached the patch that does some corrections in code and documentation. I have corrected indentation in pl_comp.c and "read_datatype" function in pl_gram.y. I think changes in "read_datatype" function would be better to avoid a code duplication. But I could be wrong of course. Conclusion: The patch could be applied on master with documentation corrections. But I'm not sure that your task could be resloved only by adding %ARRAYTYPE and %ELEMENTTYPE. Maybe you will give some examples? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/plpgsql.sgml --- b/doc/src/sgml/plpgsql.sgml *** *** 710,715 SELECT merge_fields(t.*) FROM table1 t WHERE ... ; --- 710,756 + +Array Types + + + variable%ARRAYTYPE + + + + %ARRAYTYPE provides the array type from a variable or + table column. You can use this to declare array variables that will + hold database values. To declare an array variable that will hold + values from users.user_id you can write: + + user_id users.user_id%ARRAYTYPE; + + + + + By using %ARRAYTYPE you don't need to know the data + type of elements you are referencing. + + + + +Array Element Types + + + variable%ELEMENTTYPE + + + + %ELEMENTTYPE provides the element type of a given + array. To declare a variable with the same data type as + users array element you can write: + + user_id users%ELEMENTTYPE; + + + + + Collation of PL/pgSQL Variables *** a/src/pl/plpgsql/src/pl_comp.c --- b/src/pl/plpgsql/src/pl_comp.c *** *** 1619,1625 plpgsql_parse_tripword(char *word1, char *word2, char *word3, /* * Derive type from ny base type controlled by reftype_mode - * */ static PLpgSQL_type * derive_type(PLpgSQL_type *base_type, int reftype_mode) --- 1619,1624 *** *** 1661,1667 derive_type(PLpgSQL_type *base_type, int reftype_mode) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("there are not array type for type %s", ! format_type_be(base_type->typoid; return plpgsql_build_datatype(typoid, -1, plpgsql_curr_compile->fn_input_collation); --- 1660,1666 ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("there are not array type for type %s", ! format_type_be(base_type->typoid; return plpgsql_build_datatype(typoid, -1, plpgsql_curr_compile->fn_input_collation); *** a/src/pl/plpgsql/src/pl_gram.y --- b/src/pl/plpgsql/src/pl_gram.y *** *** 2694,2700 read_datatype(int tok) StringInfoData ds; char *type_name; int startlocation; ! PLpgSQL_type *result; int parenlevel = 0; /* Should only be called while parsing DECLARE sections */ --- 2694,2700 StringInfoData ds; char *type_name; int startlocation; ! PLpgSQL_type *result = 0; int parenlevel = 0; /* Should only be called while parsing DECLARE sections */ *** *** 2720,2751 read_datatype(int tok) tok = yylex(); if (tok_is_keyword(tok, &yylval, K_TYPE, "type")) - { result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_TYPE); ! if (result) ! return result; ! } ! if (tok_is_keyword(tok, &yylval, ! K_ELEMENTTYPE, "elementtype")) ! { result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ELEMENT); ! if (result) ! return result; ! } ! if (tok_is_keyword(tok, &yylval, ! K_ARRAYTYPE, "arraytype")) ! { result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ARRAY); - if (result) - return result; - } else if (tok_is_keyword(tok, &yylval, K_ROWTYPE, "rowtype")) - { result = plpgsql_parse_wordrowtype(dtname); ! if (result) ! return result; ! } } } else if (plpgsql_token_is_unreserved_keyword(tok)) --- 2720,2737 tok = yylex(); if (tok_is_keyword(tok, &
Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension
On 18.12.2015 22:43, Artur Zakirov wrote: Hello. PostgreSQL has a contrib module named pg_trgm. It is used to the fuzzy text search. It provides some functions and operators for determining the similarity of the given texts using trigram matching. Sorry, I have forgotten to mark previous message with [PROPOSAL]. I registered the patch in commitfest: https://commitfest.postgresql.org/8/448/ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow replication roles to use file access functions
On 03.09.2015 05:40, Michael Paquier wrote: Ah, OK. I thought that you were referring to a protocol where caller sends a single LSN from which it gets a differential backup that needs to scan all the relation files of the source cluster to get the data blocks with an LSN newer than the one sent, and then sends them back to the caller. I guess that what you are suggesting instead is an approach where caller sends something like that through the replication protocol with a relation OID and a block list: BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...] Which is close to what pg_read_binary_file does now for a superuser. We would need as well to extend BASE_BACKUP so as it does not include relation files though for this use case. Regards, Hi, we need to run pg_rewind without using a superuser role too. What if the new parameter EXCLUDE_DATA_FILES will be added to the BASE_BACKUP command? This parameter will force the BASE_BACKUP command to not include relation files. And pg_rewind can execute, for example, the following command: BASE_BACKUP LABEL 'pg_rewind base backup' WAL EXCLUDE_DATA_FILES This command will be executed if --source-server parameter is defined. Are there any pitfalls in this condition? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 16.11.2015 15:51, Artur Zakirov wrote: On 10.11.2015 13:23, Artur Zakirov wrote: Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz I have done some changes in documentation in the section "12.6. Dictionaries". I have added some description how to load Ispell and Hunspell dictionaries and description about Ispell and Hunspell formats. Patches for the documentation and for the code are attached separately. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString; Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 429,443 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); if (err) ! { ! char errstr[100]; ! ! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg("invalid regular expression: %s", errstr))); ! } } Affix->flagflags = flagflags; --- 496,504 err = pg_regcomp
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 10.11.2015 13:23, Artur Zakirov wrote: Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz Hello! Do you have any remarks or comments about my patch? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
08.11.2015 14:23, Artur Zakirov пишет: Thank you for reply. This was because of the flag field size of the SPELL struct. And long flags were being trancated in the .dict file. I attached new patch. It is temporary patch, not final. It can be done better. I have updated the patch and attached it. Now dynamic memory allocation is used to the flag field of the SPELL struct. I have valued time of a dictionary loading and memory using by a dictionary in the new patch. Dictionary is loaded at the first reference to it. For example, if we execute ts_lexize function. And first ts_lexize executing takes more time than second. The following table shows performance of some dictionaries before patch and after in my computer. - | | loading time, ms | memory, MB | | | before | after | before | after | - |ar| 700 | 300 | 23,7| 15,7| |br_fr | 410 | 450 | 27,4| 27,5| |ca| 248 | 245 | 14,7| 15,4| |en_us | 100 | 100 | 5,4 | 6,2 | |fr| 160 | 178 | 13,7| 14,1| |gl_es | 160 | 150 | 9 | 9,4 | |is| 260 | 202 | 16,1| 16,3| - As you can see, substantially loading time and memory using before and after the patch are same. Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString; Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node;
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 07.11.2015 17:20, Emre Hasegeli wrote: It seems to have something to do with the order of the affixes. It works, if I move affix 2646 to the beginning of the list. [1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip Thank you for reply. This was because of the flag field size of the SPELL struct. And long flags were being trancated in the .dict file. I attached new patch. It is temporary patch, not final. It can be done better. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strcmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = cpstrdup(Conf, flag); Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 429,443 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); if (err) ! { ! char errstr[100]; ! ! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg("invalid regular expression: %s", errstr))); ! } } Affix->flagflags = flagflags; --- 496,504 err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); + /* Ignore regul
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
06.11.2015 12:33, Artur Zakirov пишет: Hello again! Patches === Link to commitfest: https://commitfest.postgresql.org/8/420/ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Hello again! Patches === I had implemented support for FLAG long, FLAG num and AF parameters. I attached patch to the e-mail (hunspell-dict.patch). This patch allow to use Hunspell dictionaries listed in the previous e-mail: ar, br_fr, ca, ca_valencia, en_ca, en_gb, en_us, fr, gl_es, hu_hu, is, ne_np, nl_nl, si_lk. The most part of changes was in spell.c in the affix file parsing code. The following are dictionary structures changes: - useFlagAliases and flagMode fields had been added to the IspellDict struct; - flagval array size had been increased from 256 to 65000; - flag field of the AFFIX struct also had been increased. I also had implemented a patch that fixes an error from the e-mail http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru This patch just ignore that error. Tests = Extention test dictionaries for loading into PostgreSQL and for normalizing with ts_lexize function can be downloaded from https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz It would be nice if somebody can do additional tests of dictionaries of well known languages. Because I do not know many of them. Other Improvements == There are also some parameters for compound words. But I am not sure that we want use this parameters. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 595,604 addFlagValue(IspellDict *Conf, char *s, uint32 val) (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("multibyte flag character is not allowed"))); ! Conf->flagval[*(unsigned char *) s] = (unsigned char) val; Conf->usecompound = true; } /* * Import an affix file that follows MySpell or Hunspell format */ --- 662,719 (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("multibyte flag character is not allowed"))); ! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val; Conf->usecompound = true; } + static int + getFlagValues(IspellDict *Conf, char *s) + { + uint32 flag = 0; + char *flagcur; + char *flagnext = 0; + + flagcur = s; + while (*flagcur) + { + flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)]; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return flag; + } + + /* + * Ge
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
20.10.2015 17:00, Artur Zakirov пишет: These flag types are used in affix files of such dictionaries as ar, br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is, ne_np, nl_nl, si_lk (from http://cgit.freedesktop.org/libreoffice/dictionaries/tree/). Now almost all dictionaries are loaded into PostgreSQL. But the da_dk dictionary does not load. I see the following error: ERROR: invalid regular expression: quantifier operand invalid CONTEXT: line 439 of configuration file "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s +GENITIV If you open the affix file in editor you can see that there is incorrect format of the affix 55 in 439 line (screen1.png): SFX 55 0 s +GENITIV SFX parameter should have a 5 fields. There are no field between "0" digit and "s" symbol. "+GENITIV" is the optional morphological field and ignored by PostgreSQL. I think that it is a error of the affix file. I wrote a e-mail to i...@stavekontrolden.dk to the dictionary authors about this error. What is the right decision in this case? Should PostgreSQL ignore this error and do not show it? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers