Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread db
>> Try the sequence below. Then, try to dump and then reload the database. >> When you try to reload it, you will get an error: >> >> ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > I know this could be a problem (like chr() with invalid byte pattern). And that's enough of a problem al

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis
On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote: > > > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > > Please show me concrete examples how I could introduce a > vulnerability > > > using this kind of convert() usage. > > > > Try the sequence below. Then, try to dump and then r

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout
On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote: > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > charac

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > Please show me concrete examples how I could introduce a vulnerability > > using this kind of convert() usage. > > Try the sequence below. Then, try to dump and then reload the database. > When you try to reload it, you will get an err

Re: [HACKERS] Ts_rank internals

2007-09-10 Thread Teodor Sigaev
I tried to understand how ts_rank works, but I failed. What does Cover function do? How does it work? What is the DocRepresentation data structure like? I can see the definition of the struct, and the get_docrep function to convert to that format, but by reading those I can't figure out what the r

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > Please show me concrete examples how I could introduce a vulnerability > using this kind of convert() usage. Try the sequence below. Then, try to dump and then reload the database. When you try to reload it, you will get an error: ERROR: i

[HACKERS] Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

2007-09-10 Thread Greg Smith
Renaming the old thread to more appropriately address the topic: On Wed, 5 Sep 2007, Kevin Grittner wrote: Then I would test the new background writer with synchronous commits under the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you recommended for a test are how far from

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis
On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote: > > Isn't the collation a locale issue, not an encoding issue? Is there a > > ja_JP.UTF-8 that defines the proper order? > > I don't think it helps. The point is, he needs different language's > collation, while PostgreSQL allows only one coll

[HACKERS] What is happening on buildfarm member dugong?

2007-09-10 Thread Tom Lane
dugong has been failing contribcheck repeatably for the last day or so, with a very interesting symptom: CREATE DATABASE is failing with ERROR: could not fsync segment 0 of relation 1663/40960/41403: No such file or directory ERROR: checkpoint request failed HINT: Consult recent messages in th

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> Tatsuo Ishii <[EMAIL PROTECTED]> writes: > >> BTW, it strikes me that there is another hole that we need to plug in > >> this area, and that's the convert() function. Being able to create > >> a value of type text that is not in the database encoding is simply > >> broken. Perhaps we could make

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes: >> BTW, it strikes me that there is another hole that we need to plug in >> this area, and that's the convert() function. Being able to create >> a value of type text that is not in the database encoding is simply >> broken. Perhaps we could make it work o

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > I'm not sure we are going to be able to catch every path by which > invalid data can get into the database in one release. I suspect we > might need two or three goes at this. (I'm just wondering if the > routines that return cstrings are a possible v

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Jeff Davis wrote: On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Per

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tatsuo Ishii wrote: BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Perhaps we could make it work on bytea instead (providing a

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > > BTW, it strikes me that there is another hole that we need to plug in > > > this area, and that's the convert() function. Being able to create > > > a value of type text that is not in the database encoding is simply > > > broken. Per

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > BTW, it strikes me that there is another hole that we need to plug in > > this area, and that's the convert() function. Being able to create > > a value of type text that is not in the database encoding is simply > > broken. Perhaps we co

Re: [HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Tom Lane
Neil Conway <[EMAIL PROTECTED]> writes: > I personally find "xact" to be a less intuitive abbreviation of > "transaction" than "txn", but for the sake of consistency, I agree it is > probably better to use "xact_start". Barring other objections, I'll go make this happen. r

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > If you regard the unicode code point as simply a number, why not > > regard the multibyte characters as a number too? > > Because there's a standard specifying the Unicode code points *as > numbers*. The mapping from those numbers to UTF8 strings (an

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tatsuo Ishii wrote: If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? No. The number corresponding to a given byte pattern depends on the endianness of

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes: > If you regard the unicode code point as simply a number, why not > regard the multibyte characters as a number too? Because there's a standard specifying the Unicode code points *as numbers*. The mapping from those numbers to UTF8 strings (and other repr

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway.

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway.

Re: [HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Neil Conway
On Mon, 2007-09-10 at 21:04 -0400, Tom Lane wrote: > I have just noticed that a column "txn_start" has appeared in > pg_stat_activity since 8.2. It's a good idea, but who chose the name? Me. > I'm inclined to rename it to "xact_start", which is an abbreviation > that we *do* use in the code, and

[HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Tom Lane
I have just noticed that a column "txn_start" has appeared in pg_stat_activity since 8.2. It's a good idea, but who chose the name? We do not use that abbreviation for "transaction" anywhere else in Postgres, certainly not in any user-exposed places. I'm inclined to rename it to "xact_start", whi

Re: [HACKERS] [ADMIN] reindexdb hangs

2007-09-10 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes: > I am unsure if I should backpatch to 8.1: the code in cluster.c has > changed, and while it is relatively easy to modify the patch, this is a > rare bug and nobody has reported it in CLUSTER (not many people clusters > temp tables, it seems). Should I p

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera
Avery Payne wrote: > >I thought maybe we can call it COAST, Column-oriented attribute storage > technique, :-) > > I like it. :-)http://www.2ndQuadrant.com";> I > just wish I would have read this before applying for a project name > at pgfoundry, the current proposal is given as "pg-cstore".

Re: [HACKERS] [ADMIN] reindexdb hangs

2007-09-10 Thread Alvaro Herrera
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > I'm not sure I follow. Are you suggesting adding a new function, > > similar to pg_class_ownercheck, which additionally checks for temp-ness? > > No, I was just suggesting adding the check for temp-ness in cluster() > and cluster_re

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Avery Payne
>ISTM we would be able to do this fairly well if we implemented >Index-only columns. i.e. columns that don't exist in the heap, only in >an index. >Taken to the extreme, all columns could be removed from the heap and >placed in an index(es). Only the visibility information would remain on >

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Tom Lane
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> Is anyone working on providing basic regression tests for the different >> dictionary types? Seems like the main stumbling block is providing > I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So, > what is better practice to b

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Teodor Sigaev
Is anyone working on providing basic regression tests for the different dictionary types? Seems like the main stumbling block is providing I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So, what is better practice to builtin it? Make it installable with regular proc

Re: [HACKERS] GucContext of log_autovacuum

2007-09-10 Thread Bruce Momjian
FYI, this has been committed by Tom. --- ITAGAKI Takahiro wrote: > The GucContext of log_autovacuum is PGC_BACKEND in the CVS HEAD, > but should it be PGC_SIGHUP? We cannot modify the variable on-the-fly > because the parame

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout
On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote: > BTW, I'm sure this was discussed but I forgot the conclusion: should > chr(0) throw an error? If we're trying to get rid of embedded-null > problems, seems it must. It is pointed out on wikipedia that Java sometimes uses to byte pair C0

Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-10 Thread Florian G. Pflug
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Currently, we do not assume that either the childXids array, nor the xid cache in the proc array are sorted by ascending xid order. I believe that we could simplify the code, further reduce the locking requirements, and enabled a tra

Re: [HACKERS] Are we done with sync-commit-defaults-to-off patch?

2007-09-10 Thread Andrew Dunstan
Tom Lane wrote: Alvaro Herrera <[EMAIL PROTECTED]> writes: Maybe we should lower the autovac naptime too, just to make it do some more stuff (and to see if it breaks something else just because of being running). Well, Andrew has committed the pg_regress extension to allow buildfarm

Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-10 Thread Tom Lane
"Florian G. Pflug" <[EMAIL PROTECTED]> writes: > Currently, we do not assume that either the childXids array, nor > the xid cache in the proc array are sorted by ascending xid order. > I believe that we could simplify the code, further reduce the locking > requirements, and enabled a transaction to

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tom Lane wrote: OK. Looking back, there was also some mention of changing chr's argument to bigint, but I'd counsel against doing that. We should not need it since we only support 4-byte UTF8, hence code points only up to 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,

[HACKERS] Ts_rank internals

2007-09-10 Thread Heikki Linnakangas
Hi, I tried to understand how ts_rank works, but I failed. What does Cover function do? How does it work? What is the DocRepresentation data structure like? I can see the definition of the struct, and the get_docrep function to convert to that format, but by reading those I can't figure out what t

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> I think Teodor's solution is wrong as it stands, because if the subquery >> finds matches for mapcfg and maptokentype, but none of those rows >> produce a non-null ts_lexize result, it will instead emit one row with a >> null result, which is not what sh

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> BTW, I'm sure this was discussed but I forgot the conclusion: should >> chr(0) throw an error? > I think it should, yes. OK. Looking back, there was also some mention of changing chr's argument to bigint, but I'd counsel against doi

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Teodor Sigaev
I think Teodor's solution is wrong as it stands, because if the subquery finds matches for mapcfg and maptokentype, but none of those rows produce a non-null ts_lexize result, it will instead emit one row with a null result, which is not what should happen. But concatenation with NULL will have r

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Uh, how will that help? AFAICS it still has to call ts_lexize with >> every dictionary. > No, ts_lexize is no longer in the seq scan filter, but in the sort key > that's calculated only for those rows that match the filter 'map

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tom Lane wrote: BTW, I'm sure this was discussed but I forgot the conclusion: should chr(0) throw an error? If we're trying to get rid of embedded-null problems, seems it must. I think it should, yes. cheers andrew ---(end of broadcast)

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tatsuo Ishii wrote: I don't understand whole discussion. Why do you think that employing the Unicode code point as the chr() argument could avoid endianness issues? Are you going to represent Unicode code point as UCS-4? Then you have to specify the endianness anyway. (see the UCS-4 standard

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout
On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote: > Why do you think that employing the Unicode code point as the chr() > argument could avoid endianness issues? Are you going to represent > Unicode code point as UCS-4? Then you have to specify the endianness > anyway. (see the UCS-4 s

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
BTW, I'm sure this was discussed but I forgot the conclusion: should chr(0) throw an error? If we're trying to get rid of embedded-null problems, seems it must. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote: > Oleg Bartunov <[EMAIL PROTECTED]> writes: > > On Mon, 10 Sep 2007, Simon Riggs wrote: > >> Can we include that functionality now? > > > This could be realized very easyly using dict_strict, which returns > > only known words, and mapping contain

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii
> Andrew Dunstan <[EMAIL PROTECTED]> writes: > > The reason we are prepared to make an exception for Unicode is precisely > > because the code point maps to an encoding pattern independently of > > architecture, ISTM. > > Right --- there is a well-defined standard for the numerical value of > ea

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Martijn van Oosterhout
On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote: > I think that if the case of >1 entry per hash becomes common enough to > be significant, and the key is stored in the hash, that a btree will > perform equal or better, and there is no point in pursuing such a hash > index model. Th

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera
Mark Mielke wrote: > Simon Riggs wrote: >> ISTM we would be able to do this fairly well if we implemented >> Index-only columns. i.e. columns that don't exist in the heap, only in >> an index. >> Taken to the extreme, all columns could be removed from the heap and >> placed in an index(es). Only t

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Heikki Linnakangas
Tom Lane wrote: > Teodor Sigaev <[EMAIL PROTECTED]> writes: >>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, >>> $1). That means that it will call ts_lexize on every dictionary, which >>> will try to load every dictionary. And loading danish_stem dictionary >>> fails in

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Mark Mielke
Simon Riggs wrote: ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, >> $1). That means that it will call ts_lexize on every dictionary, which >> will try to load every dictionary. And loading danish_stem dictionary >> fails in latin2 encoding, becau

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Mark Mielke
Kenneth Marshall wrote: On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: Kenneth Marshall <[EMAIL PROTECTED]> writes: ... This is the rough plan. Does anyone see anything critical that is missing at this point? Sounds pretty good. Let me brain-dump one item on you: one

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Jens-Wolfhard Schicke
More random thoughts: - Hash-Indices are best for unique keys, but every table needs a new hash key, which means one more random page access. Is there any way to build multi-_table_ indices? A join might then fetch all table rows with a given unique key after one page fetch for the combined in

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: On Mon, 10 Sep 2007, Simon Riggs wrote: Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So,

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Perhaps we're talking at cross purposes. The problem with doing encoding validation in scan.l is that it lacks context. Null bytes are only the tip of the bytea iceberg, since any arbitrary sequence of bytes can be valid

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Tom Lane
Oleg Bartunov <[EMAIL PROTECTED]> writes: > On Mon, 10 Sep 2007, Simon Riggs wrote: >> Can we include that functionality now? > This could be realized very easyly using dict_strict, which returns > only known words, and mapping contains only this dictionary. So, > feel free to write it and submit

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Those should be checked already --- if not, the right fix is still to >> fix it there, not in per-datatype code. I think we are OK though, >> eg see "need_transcoding" logic in copy.c. > Well, a little experimentation shows that we c

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > The reason we are prepared to make an exception for Unicode is precisely > because the code point maps to an encoding pattern independently of > architecture, ISTM. Right --- there is a well-defined standard for the numerical value of each character i

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support v

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote: There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking GIN is great, sorry if that sounded negative. GIN doesn

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Perhaps we're talking at cross purposes. > The problem with doing encoding validation in scan.l is that it lacks > context. Null bytes are only the tip of the bytea iceberg, since any > arbitrary sequence of bytes can be valid for a bytea. If you thi

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Albe Laurenz wrote: I'd like to repeat my suggestion for chr() and ascii(). Instead of the code point, I'd prefer the actual encoding of the character as argument to chr() and return value of ascii(). [snip] Of course, if it is generally perceived that the code point is more useful than

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote: > > There are clear indications that indexing too many words is a problem > > for both GIN and GIST. If people already know what they'll be looking GIN is great, sorry if that sounded negative. > GIN doesn't depend strongly on number of word

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote: > On Mon, 10 Sep 2007, Simon Riggs wrote: > > > On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: > >> On Mon, 10 Sep 2007, Simon Riggs wrote: > >> > >>> It seems possible to write your own functions to support various > >>> possibiliti

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Heikki Linnakangas
Tom Lane wrote: > I don't know enough about ispell to > understand what its config files look like. (There's a problem of > missing documentation here, too...) Yeah :(. The file format that ispell accepts is kind of ad hoc. It accepts hunspell and ispell and myspell variants, but only a subset of

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Teodor Sigaev
There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking GIN doesn't depend strongly on number of words. It has log(N) behaviour for numbers of words because of using B-Tree over words. -- Teodor Sigaev

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: > On Mon, 10 Sep 2007, Simon Riggs wrote: > > > It seems possible to write your own functions to support various > > possibilities with text search. > > > > One of the more common thoughts is to have a list of words that you > > would like to

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Teodor Sigaev
How does that allow me to limit the number of words to a known list? If all dictionaries returns NULL for token the this token will not be indexed at all. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http:

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Teodor Sigaev
Is anyone working on providing basic regression tests for the different dictionary types? Seems like the main stumbling block is providing I'll do some tests for dictionaries, but it will be synthetic dictionary. Original ispell files is rather big, so I'll make rather simple and small one.

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote: Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would li

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It seems possible to write your own functions to support various > > possibilities with text search. > > > > One of the more common thoughts is to have a list of words that you > > would like to include, i.e. the

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: In the short run it might be best to do it in scan.l after all. I have not come up with a way of doing that and handling the bytea case. AFAICS we have no realistic choice other than to rej

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Teodor Sigaev
Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, $1). That means that it will call ts_lexize on every dictionary, which will try to load every dictionary. And loading danish_stem dictionary fails in latin2 encoding, because of the problem with the stopword file. Attached

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov
On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Heikki Linnakangas
Simon Riggs wrote: > It seems possible to write your own functions to support various > possibilities with text search. > > One of the more common thoughts is to have a list of words that you > would like to include, i.e. the opposite of a stop word list. > > There are clear indications that ind

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Simon Riggs
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote: > So I've been seeing/hearing all of the hoopla over vertical databases > (column stores), and how they'll not only slice bread but also make > toast, etc. I've done some quick searches for past articles on > "C-Store", "Vertica", "Column S

[HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs
It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Jens-Wolfhard Schicke
--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke <[EMAIL PROTECTED]> wrote: Kenneth Marshall wrote: Along with the hypothetical performance wins, the hash index space efficiency would be improved by a similar factor. Obviously, all of these ideas would need to be tested in various work

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread db
>>> I think the concern is when they use only one slash, like: >>> E'\377\000\377'::bytea >>> which, as I mentioned before, is not correct anyway. > > Wait, why would this be wrong? How would you enter the three byte bytea of > consisting of those three bytes described above? Either as E'\\377\

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes: > Jeff Davis <[EMAIL PROTECTED]> writes: > >> I think the concern is when they use only one slash, like: >> E'\377\000\377'::bytea >> which, as I mentioned before, is not correct anyway. Wait, why would this be wrong? How would you enter the three byte by

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Albe Laurenz
Tom Lane wrote: >> . for chr() under UTF8, it seems to be generally agreed >> that the argument should represent the codepoint and the >> function should return the correspondingly encoded character. >> If so, possible the argument should be a bigint to >> accommodate the full range of possible cod