I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.
The two major changes are:
1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.
2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.
Attachments:
tokens_output.txt: sample queries and results with the patch
token_v1.patch: patch wrt cvs head
Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.
-Sushant.
On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha <[email protected]> writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make it work like compound word support, having just "wikipedia"
> >> and "org" as tokens.
>
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
>
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
>
> host wikipedia.org
> host-part wikipedia
> host-part org
>
> not just the "host" token as at present.
>
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
>
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
>
> regards, tom lane
1. FILEPATH
testdb=# SELECT ts_debug('/stuff/index.html');
ts_debug
--------------------------------------------------------------------------------
--
(file,"File or path name",/stuff/index.html,{simple},simple,{/stuff/index.html}
)
(blank,"Space symbols",/,{},,)
(asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff})
(blank,"Space symbols",/,{},,)
(asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html})
SELECT to_tsvector('english', '/stuff/index.html');
to_tsvector
----------------------------------------------------
'/stuff/index.html':0 'html':2 'index':1 'stuff':0
(1 row)
2. URL
testdb=# SELECT ts_debug('http://example.com/stuff/index.html');
ts_debug
--------------------------------------------------------------------------------
-------
(protocol,"Protocol head",http://,{},,)
(url,URL,example.com/stuff/index.html,{simple},simple,{example.com/stuff/index.
html})
(host,Host,example.com,{simple},simple,{example.com})
(asciiword,"Word, all ASCII",example,{english_stem},english_stem,{exampl})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",com,{english_stem},english_stem,{com})
(url_path,"URL path",/stuff/index.html,{simple},simple,{/stuff/index.html})
(blank,"Space symbols",/,{},,)
(asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff})
(blank,"Space symbols",/,{},,)
(asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html})
(13 rows)
testdb=# SELECT to_tsvector('english', 'http://example.com/stuff/index.html');
to_tsvector
--------------------------------------------------------------------------------
----------------------------------------
'/stuff/index.html':2 'com':1 'exampl':0 'example.com':0 'example.com/stuff/ind
ex.html':0 'html':4 'index':3 'stuff':2
3. EMAIL
testdb=# SELECT ts_debug('[email protected]');
ts_debug
-----------------------------------------------------------------------------
(email,"Email address",[email protected],{simple},simple,{[email protected]})
(asciiword,"Word, all ASCII",sushant,{english_stem},english_stem,{sushant})
(blank,"Space symbols",@,{},,)
(asciiword,"Word, all ASCII",foo,{english_stem},english_stem,{foo})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",bar,{english_stem},english_stem,{bar})
testdb=# SELECT to_tsvector('english', '[email protected]');
to_tsvector
-------------------------------------------------
'bar':2 'foo':1 'sushant':0 '[email protected]':0
4. HOST
testdb=# SELECT ts_debug('foo.bar.com');
ts_debug
---------------------------------------------------------------------
(host,Host,foo.bar.com,{simple},simple,{foo.bar.com})
(asciiword,"Word, all ASCII",foo,{english_stem},english_stem,{foo})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",bar,{english_stem},english_stem,{bar})
(blank,"Space symbols",.,{},,)
(asciiword,"Word, all ASCII",com,{english_stem},english_stem,{com})
testdb=# SELECT to_tsvector('english', 'foo.bar.com');
to_tsvector
-----------------------------------------
'bar':1 'com':2 'foo':0 'foo.bar.com':0
? .swp
? GNUmakefile
? config.log
? config.status
? src/Makefile.global
? src/backend/postgres
? src/backend/snowball/snowball_create.sql
? src/backend/tsearch/.wparser_def.c.swp
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.3
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.3
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/include/stamp-h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.2
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.4
? src/port/pg_config_paths.h
? src/test/regress/log
? src/test/regress/pg_regress
? src/test/regress/regression.diffs
? src/test/regress/regression.out
? src/test/regress/results
? src/test/regress/testtablespace
? src/test/regress/tmp_check
? src/test/regress/expected/constraints.out
? src/test/regress/expected/copy.out
? src/test/regress/expected/create_function_1.out
? src/test/regress/expected/create_function_2.out
? src/test/regress/expected/largeobject.out
? src/test/regress/expected/largeobject_1.out
? src/test/regress/expected/misc.out
? src/test/regress/expected/tablespace.out
? src/test/regress/sql/constraints.sql
? src/test/regress/sql/copy.sql
? src/test/regress/sql/create_function_1.sql
? src/test/regress/sql/create_function_2.sql
? src/test/regress/sql/largeobject.sql
? src/test/regress/sql/misc.sql
? src/test/regress/sql/tablespace.sql
? src/timezone/zic
Index: src/backend/tsearch/ts_parse.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving revision 1.17
diff -u -r1.17 ts_parse.c
--- src/backend/tsearch/ts_parse.c 26 Feb 2010 02:01:05 -0000 1.17
+++ src/backend/tsearch/ts_parse.c 1 Sep 2010 05:59:35 -0000
@@ -19,7 +19,7 @@
#include "tsearch/ts_utils.h"
#define IGNORE_LONGLEXEME 1
-
+#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19)
/*
* Lexize subsystem
*/
@@ -407,8 +407,6 @@
{
TSLexeme *ptr = norms;
- prs->pos++; /* set pos */
-
while (ptr->lexeme)
{
if (prs->curwords == prs->lenwords)
@@ -429,6 +427,10 @@
prs->curwords++;
}
pfree(norms);
+
+ if (!COMPLEX_TOKEN(type))
+ prs->pos++; /* set pos */
+
}
} while (type > 0);
Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.33
diff -u -r1.33 wparser_def.c
--- src/backend/tsearch/wparser_def.c 19 Aug 2010 05:57:34 -0000 1.33
+++ src/backend/tsearch/wparser_def.c 1 Sep 2010 05:59:36 -0000
@@ -23,7 +23,7 @@
/* Define me to enable tracing of parser behavior */
-/* #define WPARSER_TRACE */
+#define WPARSER_TRACE
/* Output token categories */
@@ -249,7 +249,8 @@
TParserPosition *state;
bool ignore;
bool wanthost;
-
+ int partstop;
+ TParserState afterpart;
/* silly char */
char c;
@@ -617,7 +618,32 @@
}
return 1;
}
+static int
+p_ispartbingo(TParser *prs)
+{
+ int ret = 0;
+ if (prs->partstop > 0)
+ {
+ ret = 1;
+ if (prs->partstop <= prs->state->posbyte)
+ {
+ prs->state->state = prs->afterpart;
+ prs->partstop = 0;
+ }
+ else
+ prs->state->state = TPS_Base;
+ }
+ return ret;
+}
+static int
+p_ispart(TParser *prs)
+{
+ if (prs->partstop > 0)
+ return 1;
+ else
+ return 0;
+}
/* deliberately suppress unused-function complaints for the above */
void _make_compiler_happy(void);
@@ -688,6 +714,21 @@
}
static void
+SpecialPart(TParser *prs)
+{
+ prs->partstop = prs->state->posbyte;
+ prs->state->posbyte -= prs->state->lenbytetoken;
+ prs->state->poschar -= prs->state->lenchartoken;
+ prs->afterpart = TPS_Base;
+}
+static void
+SpecialUrlPart(TParser *prs)
+{
+ SpecialPart(prs);
+ prs->afterpart = TPS_InURLPathStart;
+}
+
+static void
SpecialVerVersion(TParser *prs)
{
prs->state->posbyte -= prs->state->lenbytetoken;
@@ -1057,6 +1098,7 @@
{p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
{p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
{p_iseqC, '&', A_PUSH, TPS_InXMLEntityFirst, 0, NULL},
+ {p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL},
{p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL},
@@ -1068,6 +1110,7 @@
{p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
{p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+ {p_ispartbingo, 0, A_BINGO, TPS_Null, NUMWORD, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
@@ -1078,6 +1121,7 @@
static const TParserStateActionItem actionTPS_InAsciiWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
{p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_ispartbingo, 0, A_BINGO, TPS_Null, ASCIIWORD, NULL},
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
@@ -1105,13 +1149,14 @@
static const TParserStateActionItem actionTPS_InUnsignedInt[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, UNSIGNEDINT, NULL},
{p_isdigit, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL},
+ {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+ {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+ {p_ispartbingo, 0, A_BINGO, TPS_Null, UNSIGNEDINT, NULL},
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL},
{p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL},
{p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL},
- {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL},
- {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL},
- {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, UNSIGNEDINT, NULL}
};
@@ -1418,7 +1463,7 @@
};
static const TParserStateActionItem actionTPS_InHostDomain[] = {
- {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL},
+ {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart},
{p_isasclet, 0, A_NEXT, TPS_InHostDomain, 0, NULL},
{p_isdigit, 0, A_PUSH, TPS_InHost, 0, NULL},
{p_iseqC, ':', A_PUSH, TPS_InPortFirst, 0, NULL},
@@ -1427,9 +1472,9 @@
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
{p_isdigit, 0, A_POP, TPS_Null, 0, NULL},
- {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_InURLPathStart, HOST, NULL},
+ {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialUrlPart},
{p_iseqC, '/', A_PUSH, TPS_InFURL, 0, NULL},
- {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL}
+ {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart}
};
static const TParserStateActionItem actionTPS_InPortFirst[] = {
@@ -1439,11 +1484,11 @@
};
static const TParserStateActionItem actionTPS_InPort[] = {
- {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL},
+ {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart},
{p_isdigit, 0, A_NEXT, TPS_InPort, 0, NULL},
- {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_InURLPathStart, HOST, NULL},
+ {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialUrlPart},
{p_iseqC, '/', A_PUSH, TPS_InFURL, 0, NULL},
- {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL}
+ {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart}
};
static const TParserStateActionItem actionTPS_InHostFirstAN[] = {
@@ -1457,6 +1502,7 @@
{p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
{p_isdigit, 0, A_NEXT, TPS_InHost, 0, NULL},
{p_isasclet, 0, A_NEXT, TPS_InHost, 0, NULL},
+ {p_ispartbingo, 0, A_BINGO | A_CLRALL, TPS_Null, WORD_T, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
@@ -1466,7 +1512,7 @@
static const TParserStateActionItem actionTPS_InEmail[] = {
{p_isstophost, 0, A_POP, TPS_Null, 0, NULL},
- {p_ishost, 0, A_BINGO | A_CLRALL, TPS_Base, EMAIL, NULL},
+ {p_ishost, 0, A_BINGO | A_CLRALL, TPS_Base, EMAIL, SpecialPart},
{NULL, 0, A_POP, TPS_Null, 0, NULL}
};
@@ -1507,22 +1553,22 @@
};
static const TParserStateActionItem actionTPS_InPathSecond[] = {
- {p_isEOF, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL},
+ {p_isEOF, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart},
{p_iseqC, '/', A_NEXT | A_PUSH, TPS_InFileFirst, 0, NULL},
- {p_iseqC, '/', A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL},
- {p_isspace, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL},
+ {p_iseqC, '/', A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart},
+ {p_isspace, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart},
{NULL, 0, A_POP, TPS_Null, 0, NULL}
};
static const TParserStateActionItem actionTPS_InFile[] = {
- {p_isEOF, 0, A_BINGO, TPS_Base, FILEPATH, NULL},
+ {p_isEOF, 0, A_BINGO, TPS_Base, FILEPATH, SpecialPart},
{p_isasclet, 0, A_NEXT, TPS_InFile, 0, NULL},
{p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
{p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
- {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL}
+ {NULL, 0, A_BINGO, TPS_Base, FILEPATH, SpecialPart}
};
static const TParserStateActionItem actionTPS_InFileNext[] = {
@@ -1544,9 +1590,9 @@
};
static const TParserStateActionItem actionTPS_InURLPath[] = {
- {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
+ {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, SpecialPart},
{p_isurlchar, 0, A_NEXT, TPS_InURLPath, 0, NULL},
- {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
+ {NULL, 0, A_BINGO, TPS_Base, URLPATH, SpecialPart}
};
static const TParserStateActionItem actionTPS_InFURL[] = {
Index: src/test/regress/expected/tsdicts.out
===================================================================
RCS file: /projects/cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v
retrieving revision 1.6
diff -u -r1.6 tsdicts.out
--- src/test/regress/expected/tsdicts.out 14 Aug 2009 14:53:20 -0000 1.6
+++ src/test/regress/expected/tsdicts.out 1 Sep 2010 05:59:37 -0000
@@ -236,9 +236,9 @@
word, numword, asciiword, hword, numhword, asciihword, hword_part, hword_numpart, hword_asciipart
WITH ispell, english_stem;
SELECT to_tsvector('ispell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
- to_tsvector
-----------------------------------------------------------------------------------------------------
- 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ to_tsvector
+---------------------------------------------------------------------------------------------------
+ 'ball':6 'book':0,4 'booking':0,4 'foot':6,9 'football':6 'footballklubber':6 'klubber':6 'sky':2
(1 row)
SELECT to_tsquery('ispell_tst', 'footballklubber');
@@ -260,9 +260,9 @@
ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
REPLACE ispell WITH hunspell;
SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
- to_tsvector
-----------------------------------------------------------------------------------------------------
- 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
+ to_tsvector
+---------------------------------------------------------------------------------------------------
+ 'ball':6 'book':0,4 'booking':0,4 'foot':6,9 'football':6 'footballklubber':6 'klubber':6 'sky':2
(1 row)
SELECT to_tsquery('hunspell_tst', 'footballklubber');
@@ -285,21 +285,21 @@
asciiword, hword_asciipart, asciihword
WITH synonym, english_stem;
SELECT to_tsvector('synonym_tst', 'Postgresql is often called as postgres or pgsql and pronounced as postgre');
- to_tsvector
----------------------------------------------------
- 'call':4 'often':3 'pgsql':1,6,8,12 'pronounc':10
+ to_tsvector
+--------------------------------------------------
+ 'call':3 'often':2 'pgsql':0,5,7,11 'pronounc':9
(1 row)
SELECT to_tsvector('synonym_tst', 'Most common mistake is to write Gogle instead of Google');
- to_tsvector
-----------------------------------------------------------
- 'common':2 'googl':7,10 'instead':8 'mistak':3 'write':6
+ to_tsvector
+---------------------------------------------------------
+ 'common':1 'googl':6,9 'instead':7 'mistak':2 'write':5
(1 row)
SELECT to_tsvector('synonym_tst', 'Indexes or indices - Which is right plural form of index?');
- to_tsvector
-----------------------------------------------
- 'form':8 'index':1,3,10 'plural':7 'right':6
+ to_tsvector
+---------------------------------------------
+ 'form':7 'index':0,2,9 'plural':6 'right':5
(1 row)
SELECT to_tsquery('synonym_tst', 'Index & indices');
@@ -319,18 +319,18 @@
SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one');
to_tsvector
----------------------------------
- '1':1,5 '12':3 '123':4 'pgsql':2
+ '1':0,4 '12':2 '123':3 'pgsql':1
(1 row)
SELECT to_tsvector('thesaurus_tst', 'Supernovae star is very new star and usually called supernovae (abbrevation SN)');
- to_tsvector
--------------------------------------------------------------
- 'abbrev':10 'call':8 'new':4 'sn':1,9,11 'star':5 'usual':7
+ to_tsvector
+------------------------------------------------------------
+ 'abbrev':9 'call':7 'new':3 'sn':0,8,10 'star':4 'usual':6
(1 row)
SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
- to_tsvector
--------------------------------------------------------
- 'card':3,10 'invit':2,9 'like':6 'look':5 'order':1,8
+ to_tsvector
+------------------------------------------------------
+ 'card':2,9 'invit':1,8 'like':5 'look':4 'order':0,7
(1 row)
Index: src/test/regress/expected/tsearch.out
===================================================================
RCS file: /projects/cvsroot/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.18
diff -u -r1.18 tsearch.out
--- src/test/regress/expected/tsearch.out 28 Apr 2010 02:04:16 -0000 1.18
+++ src/test/regress/expected/tsearch.out 1 Sep 2010 05:59:37 -0000
@@ -263,34 +263,90 @@
1 | qwe
12 | @
19 | efd.r
+ 1 | efd
+ 12 | .
+ 1 | r
12 | '
14 | http://
6 | www.com
+ 1 | www
+ 12 | .
+ 1 | com
12 | /
14 | http://
5 | aew.werc.ewr/?ad=qwe&dw
6 | aew.werc.ewr
+ 1 | aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
18 | /?ad=qwe&dw
+ 12 | /?
+ 1 | ad
+ 12 | =
+ 1 | qwe
+ 12 | &
+ 1 | dw
12 |
5 | 1aew.werc.ewr/?ad=qwe&dw
6 | 1aew.werc.ewr
+ 2 | 1aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
18 | /?ad=qwe&dw
+ 12 | /?
+ 1 | ad
+ 12 | =
+ 1 | qwe
+ 12 | &
+ 1 | dw
12 |
6 | 2aew.werc.ewr
+ 2 | 2aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
12 |
14 | http://
5 | 3aew.werc.ewr/?ad=qwe&dw
6 | 3aew.werc.ewr
+ 2 | 3aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
18 | /?ad=qwe&dw
+ 12 | /?
+ 1 | ad
+ 12 | =
+ 1 | qwe
+ 12 | &
+ 1 | dw
12 |
14 | http://
6 | 4aew.werc.ewr
+ 2 | 4aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
12 |
14 | http://
5 | 5aew.werc.ewr:8100/?
6 | 5aew.werc.ewr:8100
+ 2 | 5aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
+ 12 | :
+ 22 | 8100
18 | /?
- 12 |
+ 12 | /?
1 | ad
12 | =
1 | qwe
@@ -299,11 +355,41 @@
12 |
5 | 6aew.werc.ewr:8100/?ad=qwe&dw
6 | 6aew.werc.ewr:8100
+ 2 | 6aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
+ 12 | :
+ 22 | 8100
18 | /?ad=qwe&dw
+ 12 | /?
+ 1 | ad
+ 12 | =
+ 1 | qwe
+ 12 | &
+ 1 | dw
12 |
5 | 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32
6 | 7aew.werc.ewr:8100
+ 2 | 7aew
+ 12 | .
+ 1 | werc
+ 12 | .
+ 1 | ewr
+ 12 | :
+ 22 | 8100
18 | /?ad=qwe&dw=%20%32
+ 12 | /?
+ 1 | ad
+ 12 | =
+ 1 | qwe
+ 12 | &
+ 1 | dw
+ 12 | =%
+ 22 | 20
+ 12 | %
+ 22 | 32
12 |
7 | +4.0e-10
12 |
@@ -320,6 +406,11 @@
20 | 5.005
12 |
4 | [email protected]
+ 1 | teodor
+ 12 | @
+ 1 | stack
+ 12 | .
+ 1 | net
12 |
16 | qwe-wer
11 | qwe
@@ -349,20 +440,51 @@
12 | +
|
19 | /usr/local/fff
+ 12 | /
+ 1 | usr
+ 12 | /
+ 1 | local
+ 12 | /
+ 1 | fff
12 |
19 | /awdf/dwqe/4325
+ 12 | /
+ 1 | awdf
+ 12 | /
+ 1 | dwqe
+ 12 | /
+ 22 | 4325
12 |
19 | rewt/ewr
+ 1 | rewt
+ 12 | /
+ 1 | ewr
12 |
1 | wefjn
12 |
19 | /wqe-324/ewr
+ 12 | /
+ 1 | wqe
+ 21 | -324
+ 12 | /
+ 1 | ewr
12 |
19 | gist.h
+ 1 | gist
+ 12 | .
+ 1 | h
12 |
19 | gist.h.c
+ 1 | gist
+ 12 | .
+ 1 | h
+ 12 | .
+ 1 | c
12 |
19 | gist.c
+ 1 | gist
+ 12 | .
+ 1 | c
12 | .
1 | readline
12 |
@@ -393,14 +515,14 @@
12 |
12 | <>
1 | qwerty
-(133 rows)
+(255 rows)
SELECT to_tsvector('english', '345 [email protected] '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 [email protected] qwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>">
/usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234
<i <b> wow < jqw <> qwerty');
- to_tsvector
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- '+4.0e-10':28 '-4.2':60,62 '/?':18 '/?ad=qwe&dw':7,10,14,24 '/?ad=qwe&dw=%20%32':27 '/awdf/dwqe/4325':48 '/usr/local/fff':47 '/wqe-324/ewr':51 '1aew.werc.ewr':9 '1aew.werc.ewr/?ad=qwe&dw':8 '234':63 '234.435':32 '2aew.werc.ewr':11 '345':1 '3aew.werc.ewr':13 '3aew.werc.ewr/?ad=qwe&dw':12 '4.2':56,57,58 '455':33 '4aew.werc.ewr':15 '5.005':34 '5aew.werc.ewr:8100':17 '5aew.werc.ewr:8100/?':16 '6aew.werc.ewr:8100':23 '6aew.werc.ewr:8100/?ad=qwe&dw':22 '7aew.werc.ewr:8100':26 '7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':25 'ad':19 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':5 'asdf':39 'dw':21 'efd.r':3 'ewr1':45 'ewri2':46 'gist.c':54 'gist.h':52 'gist.h.c':53 'hjwer':44 'jf':41 'jqw':66 'qwe':2,20,29,30,37 'qwe-wer':36 'qwer':40 'qwerti':67 'qwqwe':31 'readlin':55,59,61 'rewt/ewr':49 'sdjk':42 '[email protected]':35 'wefjn':50 'wer':38 'wow':65 'www.com':4
+ to_tsvector
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ '+4.0e-10':53 '-324':84 '-4.2':98,100 '/?':34 '/?ad=qwe&dw':9,15,24,41 '/?ad=qwe&dw=%20%32':48 '/awdf/dwqe/4325':77 '/usr/local/fff':74 '/wqe-324/ewr':83 '1aew':12 '1aew.werc.ewr':12 '1aew.werc.ewr/?ad=qwe&dw':12 '20':51 '234':101 '234.435':57 '2aew':18 '2aew.werc.ewr':18 '32':52 '345':0 '3aew':21 '3aew.werc.ewr':21 '3aew.werc.ewr/?ad=qwe&dw':21 '4.2':94,95,96 '4325':79 '455':58 '4aew':27 '4aew.werc.ewr':27 '5.005':59 '5aew':30 '5aew.werc.ewr:8100':30 '5aew.werc.ewr:8100/?':30 '6aew':37 '6aew.werc.ewr:8100':37 '6aew.werc.ewr:8100/?ad=qwe&dw':37 '7aew':44 '7aew.werc.ewr:8100':44 '7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':44 '8100':33,40,47 'ad':9,15,24,34,41,48 'aew':6 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':6 'asdf':66 'awdf':77 'c':90,92 'com':5 'dw':11,17,26,36,43,50 'dwqe':78 'efd':2 'efd.r':2 'ewr':8,14,20,23,29,32,39,46,81,85 'ewr1':72 'ewri2':73 'fff':76 'gist':86,88,91 'gist.c':91 'gist.h':86 'gist.h.c':88 'h':87,89 'hjwer':71 'jf':68 'jqw':104 'local':75 'net':62 'qwe':1,10,16,25,35,42,49,54,55,64 'qwe-wer':63 'qwer':67 'qwerti':105 'qwqwe':56 'r':3 'readlin':93,97,99 'rewt':80 'rewt/ewr':80 'sdjk':69 'stack':61 'teodor':60 '[email protected]':60 'usr':74 'wefjn':82 'wer':65 'werc':7,13,19,22,28,31,38,45 'wow':103 'wqe':83 'www':4 'www.com':4
(1 row)
SELECT length(to_tsvector('english', '345 [email protected] '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 [email protected] qwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>">
@@ -408,7 +530,7 @@
<i <b> wow < jqw <> qwerty'));
length
--------
- 53
+ 85
(1 row)
-- ts_debug
@@ -428,41 +550,83 @@
-- check parsing of URLs
SELECT * from ts_debug('english', 'http://www.harewoodsolutions.co.uk/press.aspx</span>');
- alias | description | token | dictionaries | dictionary | lexemes
-----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
- protocol | Protocol head | http:// | {} | |
- url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
- host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
- url_path | URL path | /press.aspx | {simple} | simple | {/press.aspx}
- tag | XML tag | </span> | {} | |
-(5 rows)
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-----------------+----------------------------------------+----------------+--------------+------------------------------------------
+ protocol | Protocol head | http:// | {} | |
+ url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
+ host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
+ asciiword | Word, all ASCII | www | {english_stem} | english_stem | {www}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | harewoodsolutions | {english_stem} | english_stem | {harewoodsolut}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | co | {english_stem} | english_stem | {co}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | uk | {english_stem} | english_stem | {uk}
+ url_path | URL path | /press.aspx | {simple} | simple | {/press.aspx}
+ blank | Space symbols | / | {} | |
+ asciiword | Word, all ASCII | press | {english_stem} | english_stem | {press}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | aspx | {english_stem} | english_stem | {aspx}
+ tag | XML tag | </span> | {} | |
+(16 rows)
SELECT * from ts_debug('english', 'http://aew.wer0c.ewr/id?ad=qwe&dw<span>');
- alias | description | token | dictionaries | dictionary | lexemes
-----------+---------------+----------------------------+--------------+------------+------------------------------
- protocol | Protocol head | http:// | {} | |
- url | URL | aew.wer0c.ewr/id?ad=qwe&dw | {simple} | simple | {aew.wer0c.ewr/id?ad=qwe&dw}
- host | Host | aew.wer0c.ewr | {simple} | simple | {aew.wer0c.ewr}
- url_path | URL path | /id?ad=qwe&dw | {simple} | simple | {/id?ad=qwe&dw}
- tag | XML tag | <span> | {} | |
-(5 rows)
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-------------------+----------------------------+----------------+--------------+------------------------------
+ protocol | Protocol head | http:// | {} | |
+ url | URL | aew.wer0c.ewr/id?ad=qwe&dw | {simple} | simple | {aew.wer0c.ewr/id?ad=qwe&dw}
+ host | Host | aew.wer0c.ewr | {simple} | simple | {aew.wer0c.ewr}
+ asciiword | Word, all ASCII | aew | {english_stem} | english_stem | {aew}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | wer | {english_stem} | english_stem | {wer}
+ word | Word, all letters | 0c | {english_stem} | english_stem | {0c}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr}
+ url_path | URL path | /id?ad=qwe&dw | {simple} | simple | {/id?ad=qwe&dw}
+ blank | Space symbols | / | {} | |
+ asciiword | Word, all ASCII | id | {english_stem} | english_stem | {id}
+ blank | Space symbols | ? | {} | |
+ asciiword | Word, all ASCII | ad | {english_stem} | english_stem | {ad}
+ blank | Space symbols | = | {} | |
+ asciiword | Word, all ASCII | qwe | {english_stem} | english_stem | {qwe}
+ blank | Space symbols | & | {} | |
+ asciiword | Word, all ASCII | dw | {english_stem} | english_stem | {dw}
+ tag | XML tag | <span> | {} | |
+(19 rows)
SELECT * from ts_debug('english', 'http://5aew.werc.ewr:8100/?');
- alias | description | token | dictionaries | dictionary | lexemes
-----------+---------------+----------------------+--------------+------------+------------------------
- protocol | Protocol head | http:// | {} | |
- url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?}
- host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
- url_path | URL path | /? | {simple} | simple | {/?}
-(4 rows)
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-------------------+----------------------+----------------+--------------+------------------------
+ protocol | Protocol head | http:// | {} | |
+ url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?}
+ host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
+ word | Word, all letters | 5aew | {english_stem} | english_stem | {5aew}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | werc | {english_stem} | english_stem | {werc}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr}
+ blank | Space symbols | : | {} | |
+ uint | Unsigned integer | 8100 | {simple} | simple | {8100}
+ url_path | URL path | /? | {simple} | simple | {/?}
+ blank | Space symbols | /? | {} | |
+(12 rows)
SELECT * from ts_debug('english', '5aew.werc.ewr:8100/?xx');
- alias | description | token | dictionaries | dictionary | lexemes
-----------+-------------+------------------------+--------------+------------+--------------------------
- url | URL | 5aew.werc.ewr:8100/?xx | {simple} | simple | {5aew.werc.ewr:8100/?xx}
- host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
- url_path | URL path | /?xx | {simple} | simple | {/?xx}
-(3 rows)
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-------------------+------------------------+----------------+--------------+--------------------------
+ url | URL | 5aew.werc.ewr:8100/?xx | {simple} | simple | {5aew.werc.ewr:8100/?xx}
+ host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
+ word | Word, all letters | 5aew | {english_stem} | english_stem | {5aew}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | werc | {english_stem} | english_stem | {werc}
+ blank | Space symbols | . | {} | |
+ asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr}
+ blank | Space symbols | : | {} | |
+ uint | Unsigned integer | 8100 | {simple} | simple | {8100}
+ url_path | URL path | /?xx | {simple} | simple | {/?xx}
+ blank | Space symbols | /? | {} | |
+ asciiword | Word, all ASCII | xx | {english_stem} | english_stem | {xx}
+(12 rows)
-- to_tsquery
SELECT to_tsquery('english', 'qwe & sKies ');
@@ -1002,7 +1166,7 @@
SELECT to_tsvector('SKIES My booKs');
to_tsvector
----------------------------
- 'books':3 'my':2 'skies':1
+ 'books':2 'my':1 'skies':0
(1 row)
SELECT plainto_tsquery('SKIES My booKs');
@@ -1021,7 +1185,7 @@
SELECT to_tsvector('SKIES My booKs');
to_tsvector
------------------
- 'book':3 'sky':1
+ 'book':2 'sky':0
(1 row)
SELECT plainto_tsquery('SKIES My booKs');
@@ -1075,20 +1239,20 @@
select * from pendtest where 'ipsu:*'::tsquery @@ ts;
ts
--------------------
- 'ipsum':2 'lore':1
+ 'ipsum':1 'lore':0
(1 row)
select * from pendtest where 'ipsa:*'::tsquery @@ ts;
ts
--------------------
- 'ipsam':2 'lore':1
+ 'ipsam':1 'lore':0
(1 row)
select * from pendtest where 'ips:*'::tsquery @@ ts;
ts
--------------------
- 'ipsam':2 'lore':1
- 'ipsum':2 'lore':1
+ 'ipsam':1 'lore':0
+ 'ipsum':1 'lore':0
(2 rows)
select * from pendtest where 'ipt:*'::tsquery @@ ts;
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers