Which way do we choose? or I miss some variant?
I would like to go by 3) way... Comments?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev
To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?
No. As corner case, dictionary might return just a number or a hash value.
What I'm wondering about is
it. I'm inclined to make the code in parse_type.c take either integer
And modify ArrayGetTypmods() to ArrayGetIntegerTypmods()
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http:
Is it worth providing an ArrayGetStringTypmods in core, when it won't
be used by any existing core datatypes?
I don't think so - cstring[] is a set of strings itself. I don't believe that we
could suggest something commonly useful without some real-world examples.
So, added to my plan
n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
n+1) use encoding names instead of locale's names in configuration
Tom Lane wrote:
Teodor S
think this
might be wrong?
No. I believe that pgsql doesn't support encoding that can not be recoded from
UTF8, at least for non-hieroglyph languages.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
r and stop-file should be used.
For russian language with utf8 encoding it should use for lword english stemmer,
but for italian language - italian stemmer. Any ASCII chars can't present in
russian word, but might italian word can contains only ASCII.
Teodor Sigaev
self, the
same is to compatibility between two dictionaries, list of dictionaries.
In practice, we didn't see any disasters after changes in configuration - until
reindexing search becomes less punctual.
Teodor Sigaev
st WHERE to_tsvector(mytextcol) @@ query.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 1: if posting/reading
. But it's incompatible changes and cause some difficulties
for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then
text editor/tools might be confused.
Teodor Sigaev E-mail: [EMAIL
I'd suggest allowing either full names ("swedish") or the standard
two-letter abbreviations ("sv"). But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)
Teodor Sigaev
Where will index store index creation GUC?
So we create a pg_catalog full text configuration named UTF8.en-US, and
some others like ru_RU.UTF-8.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
Which way do we choose? or I miss some variant?
I would like to go by 3) way... Comments?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
danish, dutch, finnish, french, german, hungarian, italian, norwegian,
portuguese, spanish, swedish, russin and english
Albe Laurenz wrote:
Tom Lane wrote:
Teodor Sigaev <[EMAIL PROTECTED]> writes:
So, it's needed to change dictinitoption format of snowball
dictionaries to
) simplifies life a lot in most cases.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 5: don't forget to increase
h completing the
dictionary support functions to go with this infrastructure.
How will we synchronize our changes in patch?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
It should be. Instances of ispell (and synonym, thesaurus) dictionaries are
different only in dict_initoption part, so it will be only one entry in
pg_ts_dict_template and several ones in pg_ts_dict.
No, I was thinking of still having just one pg_ts_dict catalog (no template)
but removing its d
french, german, hungarian, italian, norwegian, portuguese, spanish,
swedish, russin and english languages.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
But, there are reasons to allow users register new templates and in fact
we know people/projects with application-dependent dictionaries. How
they could dump/reload their dictionaries ?
The same way as pg_am does.
Teodor Sigaev E-mail: [EMAIL PROTECTED
Tom Lane wrote:
Teodor Sigaev <[EMAIL PROTECTED]> writes:
The reason to save SQLish interface to dictionaries is a simplicity of
configuration. Snowball's stemmers are useful as is, but ispell dictionary
requires some configuration action before using.
Yeah. I had been wond
or cause backend crash. But usage of such tsvector column could be limited - not
all words will be searchable.
This sounds sort of analogous to the issues collation bring up.
Teodor Sigaev E-mail: [
can we hard-code into the backend, and just update for every major
release like we do for encodings?
Sorry, no one of them :(. We know projects which introduce new parser, new
dictionary. Config and map are changes very often.
Teodor Sigaev E-mail: [EMAIL
complex task: experienced users could use several configuration
simultaneously. For example: indexing use configuration which doesn't reject
stop-words, but for default searching use configuration which rejects
stop-words. BTW, the same effects may be produced by dictionary
urce tree (in src/snowball) only
three directory for snowball_code.tgz:
- /compiler - compiler from *.sbl to *.c
- /runtime - common code for all stemmers
- /algorithms - *.sbl files
and use pgsql's makefile infrastructure to compiling stemmers.
Comments, objections?
Teodor Sigaev
I access the original intarray that is being referenced by this
Index doesn't store original int[] at all. From GiST support fuction there is no
way to get access to table's value :(.
Teodor Sigaev
with XLogInsert. That's not safe, MarkBufferDirty needs to be called
before XLogInsert to avoid a race condition in checkpoint, see comments
in SyncOneBuffer in bufmgr.c for an explanation.
Ugh, thank you fixed. It's a trace of misunderstood of WriteBuffer().
p://snowball.tartarus.org/dist/snowball_code.tgz tarball.
2 Snowball's compiling infrastructure doesn't support Windows target.
I agree with simplify support process but, IMHO, it's much simpler to do it with
C sources with pgsql's building infrastructure
And where should it be
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please se
Patch is synced with current CVS HEAD and synced with bugfixes in
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru
nd words
( German, Norwegian ). So, please, test it - we don't know that languages at
2 added recent fixes of contrib/tsearch2
3 fix usage of fopen/fclose
Teodor Sigaev E-mail: [EMAIL PROTECTED]
on in tsearch_core patch
I find the direct use of malloc/realloc/strdup to be poor style as well
--- backend code that is not using palloc needs to have *very* good
reason to do so, and I see none here.
Already in tsearch_core patch.
Teodor Sigaev E-mail: [
r more than two bytes...
It doesn't significant matter - file reads once per backend lifetime.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
FWIW, it looks like it failed to reject stopwords. Is it possible you
I suppose the problem is with '\r\n'... Try attached patch.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www
if it's worth to make user-created fts configuration to be visible prior to
system configurations in pg_catalog, if pg_catalog was not *explicitly*
specified in the search_path ?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
t I'm afraid that will be
often for fulltext configurations. How can we avoid such situations?
Forbid changes on built-in objects?
'alter operator class .. owner to ...' doesn't dump too.
Teodor Sigaev E-mail: [EMAIL PROTECT
. (In other words, it's an optimisation rather than a big
I like your suggestion - it's useful for GiST/GIN fulltext indexing.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www
mn on which index was created,
so gistgettuple can not return tuple in original form - but sometimes
gistgettuple may be sure that recheck isn't needed.
That would completely replace the current RECHECK-option we have, right?
Yeah, this is possible.
stored in index as
is, but large value is compressed with lossy techniques. So, GiST might return a
tuple which is allowed to not recheck.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www
at? New opclass layout, new opfamily table - users don't that
changes at all.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)-
l API. Putting
tsearch in core discards a lot of such problem. For example, who notices changes
in pg_am table from release to release? Really it was a developers/hackers, not
a usual users
Teodor Sigaev E-mail: [EMAIL PROT
SING FULLTEXT (textcolumn)
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings
OP FULLTEXT MAPPING ON pg FOR email, url, sfloat, uri, float;
Patch is http://www.sigaev.ru/misc/tsearch_core-0.38.1.gz
Comparing that syntaxes with current tsearch2 is placed at
So, which is
left and right buffers?
Real write are produced by bgwriter process, in backend we should just mark
byffer as dirty with a help of MarkBufferDirty call.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http:
e you searched our list archives?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broa
I'm already started, don't worry about that. Cube is broken since TOAST
implemented :)
Gregory Stark wrote:
"Teodor Sigaev" <[EMAIL PROTECTED]> writes:
input value. As I remember, only R-Tree emulation over boxes, contrib/seg and
contrib/cube have simple compress
r has been toasted.
Should I commit it or you'll include in your patch?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
*** ./contrib/intarray.orig/./_int_gist.c Tue Mar
is three similar functions/macros:
gistentryinit, gistcentryinit and gistdentryinit :) That names was choosen by
authors initially developed GiST in pgsql.
Well if you're doing everything in short-lived memory contexts then we don't
even need this.
Teodor Sigaev
tum came from originally and how it ended up
stored in packed format.
Can you provide your patch (in current state) and test suite? Or backtrace at
Teodor Sigaev E-mail: [EMAIL PROTECTED]
sing DatumGetPointer and PG_GETARG_POINTER and having to manually cast
everywhere, no? It seems like there's a lot of extra pain to maintain the code
in the present style with all the manual casts.
Of course, I agree. Just PG_FREE_IF_COPY is extra call in support m
level function.
Try to play with SP-GiST (http://www.cs.purdue.edu/spgist/). SP-GiST is a
modification of GiST for Space Partitioning Trees. But they patch will not work
with 8.2 and up because of lack of concurrency. 8.2 doesn't support indexes
without concurrency.
GiST code works in separate memory context to prevent memory leaks. See
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
index should be decompressed by GiST decompress
support method.
Another places:
- compress might get original value in case of inserting new one, in all other
cases it gets value produced by decompress method.
- query in consistent method
Teodor Sigaev E
Opps, sorry, I missed checkpoint keyword
Teodor Sigaev wrote:
What am I missing?
Seems, it's about that
Teodor Sigaev E-mail: [EMAIL PROT
What am I missing?
Seems, it's about that
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.siga
_fulltext_mapping(cfgname, '{lexemetypename[, ...]}'::text[],
'{dictname1[, ...]}'::text[]);
Seems rather ugly for me...
And function interface does not provide autocompletion and online help in psql.
\df says only types of arguments, not a meaning.
Teodor Sigaev
IT init_function ]
[ OPT opt_text ];
[ { LEXIZE lexize_function | INIT init_function | OPT opt_text } [...] ]
LIKE template_dictname;
Teodor Sigaev E-mail: [EMAIL PROTECTED]
Precise definition for "latin" in C locale please. Are you saying that
single byte encoding with range 0-7f? is "latin"? If so, it seems they
are exacty same as ASCII.
p_islatin returns true for ASCII alpha characters.
Teodor Sigaev
leave unchanged en_stem for any latin word.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster
GiST implementation supports only unordered trees (btree_gist is a some
kind of emulation) and it cannot guarantee that equal keys will be close in
index. That's related to picksplit and gistpenalty method problem/optimization
and data set.
Teodor Sigaev
with UTF8 take two bytes.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings
Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there
That isn't simple (may be, even possible) from GIN.
Teodor Sigaev E-mail: [
earch2) always means empty
result and fast working of GIN, so, tsearch2's users will not face a error 'GIN
index does not support search with void query'
Comments, objections, suggestions?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
This might be a good idea, but it's hardly transparent; it can be
counted on to break the applications of just about everyone using those
modules today.
Hmm, can we make separate schema for all contib modules and include it in
default search_path? It will not touchs most users.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster
found texts. GIST performance may be decreased too - GIST indexing of
tsvector is lossy.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
ands. Next, functions haven't autocomplete feature or built-in
quick help - if you don't remember exactly kind/type of argument(s) of function
then you should read a docs.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
2 it requires to keyword MATCH & AGAINST which cannot be a function's name
without quoting.
Internal API changed sometimes (not every release), but I don't see a problem
here: all other internal API's in postgres are often ch
8000 )
and before committing change they to lowest possible.
Ehan HEAD is under hard development, oids change quickly, so you will need to
rearrange your oids for each snapshot.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
similar way as character conversation
library does.
If there aren't objections then we plan commit patch tomorrow or after tomorrow.
Before committing, I'll changes oids from 5000+ to lower values to prevent holes
in oids. And after that, I'll remove tsearch2 cont
nction may be
defined as:
static int
pg_wc_isalpha(pg_wchar c)
if ( (c >= 0 && c <= UCHAR_MAX) )
return isalpha((unsigned char) c)
else if ( GetDatabaseEncoding() == PG_UTF8 )
return iswalpha((wint_t) c)
I would like to suggest patches for OR-clause optimization and using index for
searching NULLs.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of
o tsearch2 locale part.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
set client_encoding='KOI8';
SELECT 'Ä' ~* '[[:alpha:]]' as "true";
using Append node - without
reintroducing multi-pass indexscan. Moreover, it allows to sort OR clauses to
match sort order.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.siga
Note that a bitmap scan or multi-pass indexscan (OR clause scan) has NIL
pathkeys since we can say nothing about the overall order of its result.
It's seems to me that multi-pass indexscan was removed after introducing
Teodor Sigaev E
u to test under Windows? Thank you.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
diff -c -r -N ../tsearch2.orig/ts_locale.c ./ts_locale.c
*** ../tsearch2.orig/ts_locale.cFri Jan 1
TParser *prs) {
! if (lc_ctype_is_c())
! {
! if (c > 0x7f)
! return 1;
I have some some doubts that any character greater than 0x7f is an alpha symbol.
Is it simple assumption or workaround?
Teodor Sigaev
Nice, thanks a lot.
Tom Lane wrote:
Teodor Sigaev <[EMAIL PROTECTED]> writes:
Just a freshing for clean applying..
Applied with some revisions, and pg_dump support and regression tests
regards, to
Sorry for delay, I was on holidays :)
Did you test patch on Windows platform?
Tatsuo Ishii wrote:
I have tested with local-enabled environment and found a bug. Included
is the new version of patches.
Teodor, Oleg, what do you think about these patches?
If ok, shall I commit to CVS head
This is not responding to my concern. What you presented was an
> Sorry, I see your point now.
Is that test enough? Or I should make more?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev
Just a freshing for clean applying..
Is any objections to commit?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru
patch: http://www.sigaev.ru/misc/tsearch_core-0.27.gz
new version, because of XML commit - old patch doesn't apply cleanly.
Teodor Sigaev E-mail: [EMAIL PROT
0.9 doesn't apply cleanly after Peter's changes, so, new version
Teodor Sigaev wrote:
>> Perhaps an array of int4 would be better? How much
The patch needs mor
TTERN] list fulltext dictionaries (add "+" for more detail)
\dFp [PATTERN] list fulltext parsers (add "+" for more detail)
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
>> Perhaps an array of int4 would be better? How much
The patch needs more cleanup before applying, too, eg make comments
match code, get rid of unused keywords added to gram.y.
gards, tom lane
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?
different storage types.
> I don't have any idea whether opclass groups would be useful for GiST or
> GIN indexes, but maybe Oleg and Teodor can think of applications.
I'm afraid it isn't useful for GiST/GIN - strategies are not defined at all and
planner can't known
Hmm, IMHO, it's needed for consistent interface: nobody adds new column to table
by editing pg_class & pg_attribute, nobody looks for description of table by
selection values from system table.
Tom Lane wrote:
Teodor Sigaev <[EMAIL PROTECTED]> writes:
Now we (Oleg and me)
dF PATTERN - describe configuration with used parser and lexeme's mapping
\dFd- list of dictionaries
\dFd PATTERN - describe dictionary
\dFp- parser's list
\dFp PATETRN- describe parser
Teodor Sigaev E-ma
Sorry for noise - it's mentioned in README.tsearch2
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)--
(1 row)
':' is separator of lexeme and its position information
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
is limit will be replaced
by 16383.
13.4.2 Only 256 positional info per lexem.
Some useful articles
k words in a document?
I don't see any problem. tsvector size should not be greater than 1Mb however.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
for speedup searches, linguistic part is still in
It's possible to use tsearch2 without any indexes at all. GiST and GIN is a way
to speedup searches.
Of course, you can develop another framework for full text search and framework
may use GIN as it wish
nd of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
Teodor Sigaev E-mail: [
g_to_array seems to eat several Gig bytes of
memory. ~70k array elements means there are same number of words in a
document which is not too big in a large text IMO.
Do you mean 70k unique lexemes? Ugh.
Why do not you use tsearch framework?
Teodor Sigaev E-ma
There's been work on it. Theodor cleaned it up for HEAD and looked at
adding GiST support. I beleive he's waiting for 8.2 to release.
Yep, I have bundle of patches and I'm waiting for 8.2 branch split out of HEAD.
Teodor Sigaev E-mail: [
Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
self contained test case directly to Teodor which shows the error.
'ERROR: index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on b
om build?
Can you send exact error message?
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 7: You can h
'psql DB < btree_gist.sql'.
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 1: if posting/reading thro
501 - 600 of 835 matches
Mail list logo