Tom Lane wrote:
. for chr() under UTF8, it seems to be generally agreed
that the argument should represent the codepoint and the
function should return the correspondingly encoded character.
If so, possible the argument should be a bigint to
accommodate the full range of possible code points.
Tom Lane [EMAIL PROTECTED] writes:
Jeff Davis [EMAIL PROTECTED] writes:
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Wait, why would this be wrong? How would you enter the three byte bytea of
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Wait, why would this be wrong? How would you enter the three byte bytea of
consisting of those three bytes described above?
Either as
E'\\377\\000\\377'
--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke
[EMAIL PROTECTED] wrote:
Kenneth Marshall wrote:
Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.
There are clear indications that indexing too many words is a
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote:
So I've been seeing/hearing all of the hoopla over vertical databases
(column stores), and how they'll not only slice bread but also make
toast, etc. I've done some quick searches for past articles on
C-Store, Vertica, Column Store,
Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.
There are clear indications that indexing
On Mon, 10 Sep 2007, Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.
There are clear
Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the problem with the stopword file.
Attached
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
AFAICS we have no realistic choice other than to
On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:
Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words
Is anyone working on providing basic regression tests for the different
dictionary types? Seems like the main stumbling block is providing
I'll do some tests for dictionaries, but it will be synthetic dictionary.
Original ispell files is rather big, so I'll make rather simple and small one.
How does that allow me to limit the number of words to a known list?
If all dictionaries returns NULL for token the this token will not be indexed at
all.
--
Teodor Sigaev E-mail: [EMAIL PROTECTED]
WWW:
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text search.
One of the more common thoughts is to have a list of words that you
would like to include,
There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
GIN doesn't depend strongly on number of words. It has log(N) behaviour for
numbers of words because of using B-Tree over words.
--
Teodor Sigaev
Tom Lane wrote:
I don't know enough about ispell to
understand what its config files look like. (There's a problem of
missing documentation here, too...)
Yeah :(. The file format that ispell accepts is kind of ad hoc. It
accepts hunspell and ispell and myspell variants, but only a subset of
On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
It seems possible to write your own functions to support various
possibilities with text
On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:
There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
GIN is great, sorry if that sounded negative.
GIN doesn't depend strongly on number of words.
Albe Laurenz wrote:
I'd like to repeat my suggestion for chr() and ascii().
Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().
[snip]
Of course, if it is generally perceived that the code point
is more useful
Andrew Dunstan [EMAIL PROTECTED] writes:
Perhaps we're talking at cross purposes.
The problem with doing encoding validation in scan.l is that it lacks
context. Null bytes are only the tip of the bytea iceberg, since any
arbitrary sequence of bytes can be valid for a bytea.
If you think
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:
There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
GIN is great, sorry if that sounded negative.
GIN
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
On Mon, 10 Sep 2007, Simon Riggs wrote:
It seems possible to write your own functions to support
Andrew Dunstan [EMAIL PROTECTED] writes:
The reason we are prepared to make an exception for Unicode is precisely
because the code point maps to an encoding pattern independently of
architecture, ISTM.
Right --- there is a well-defined standard for the numerical value of
each character in
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code. I think we are OK though,
eg see need_transcoding logic in copy.c.
Well, a little experimentation shows that we currently
Oleg Bartunov [EMAIL PROTECTED] writes:
On Mon, 10 Sep 2007, Simon Riggs wrote:
Can we include that functionality now?
This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
feel free to write it and submit.
...
Tom Lane wrote:
Andrew Dunstan [EMAIL PROTECTED] writes:
Perhaps we're talking at cross purposes.
The problem with doing encoding validation in scan.l is that it lacks
context. Null bytes are only the tip of the bytea iceberg, since any
arbitrary sequence of bytes can be valid
On Mon, 10 Sep 2007, Tom Lane wrote:
Oleg Bartunov [EMAIL PROTECTED] writes:
On Mon, 10 Sep 2007, Simon Riggs wrote:
Can we include that functionality now?
This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
More random thoughts:
- Hash-Indices are best for unique keys, but every table needs a new hash
key, which means one more random page access. Is there any way to build
multi-_table_ indices? A join might then fetch all table rows with a given
unique key after one page fetch for the combined
Kenneth Marshall wrote:
On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
Kenneth Marshall [EMAIL PROTECTED] writes:
... This is the rough plan. Does anyone see anything critical that
is missing at this point?
Sounds pretty good. Let me brain-dump one item on you: one
Teodor Sigaev [EMAIL PROTECTED] writes:
Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the
Simon Riggs wrote:
ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index.
Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would
Tom Lane wrote:
Teodor Sigaev [EMAIL PROTECTED] writes:
Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2
Mark Mielke wrote:
Simon Riggs wrote:
ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index.
Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the
On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote:
I think that if the case of 1 entry per hash becomes common enough to
be significant, and the key is stored in the hash, that a btree will
perform equal or better, and there is no point in pursuing such a hash
index model. This
Andrew Dunstan [EMAIL PROTECTED] writes:
The reason we are prepared to make an exception for Unicode is precisely
because the code point maps to an encoding pattern independently of
architecture, ISTM.
Right --- there is a well-defined standard for the numerical value of
each
On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote:
Oleg Bartunov [EMAIL PROTECTED] writes:
On Mon, 10 Sep 2007, Simon Riggs wrote:
Can we include that functionality now?
This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
regards, tom lane
---(end of broadcast)---
TIP 2: Don't 'kill
On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote:
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the UCS-4
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the UCS-4
Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
I think it should, yes.
cheers
andrew
---(end of
Heikki Linnakangas [EMAIL PROTECTED] writes:
Tom Lane wrote:
Uh, how will that help? AFAICS it still has to call ts_lexize with
every dictionary.
No, ts_lexize is no longer in the seq scan filter, but in the sort key
that's calculated only for those rows that match the filter 'mapcfg=?
AND
I think Teodor's solution is wrong as it stands, because if the subquery
finds matches for mapcfg and maptokentype, but none of those rows
produce a non-null ts_lexize result, it will instead emit one row with a
null result, which is not what should happen.
But concatenation with NULL will have
Andrew Dunstan [EMAIL PROTECTED] writes:
Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error?
I think it should, yes.
OK. Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing
Teodor Sigaev [EMAIL PROTECTED] writes:
I think Teodor's solution is wrong as it stands, because if the subquery
finds matches for mapcfg and maptokentype, but none of those rows
produce a non-null ts_lexize result, it will instead emit one row with a
null result, which is not what should
Hi,
I tried to understand how ts_rank works, but I failed. What does Cover
function do? How does it work? What is the DocRepresentation data
structure like? I can see the definition of the struct, and the
get_docrep function to convert to that format, but by reading those I
can't figure out what
Tom Lane wrote:
OK. Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing that. We should not
need it since we only support 4-byte UTF8, hence code points only up to
21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
Florian G. Pflug [EMAIL PROTECTED] writes:
Currently, we do not assume that either the childXids array, nor
the xid cache in the proc array are sorted by ascending xid order.
I believe that we could simplify the code, further reduce the locking
requirements, and enabled a transaction to
Tom Lane wrote:
Alvaro Herrera [EMAIL PROTECTED] writes:
Maybe we should lower the autovac naptime too, just to make it do some
more stuff (and to see if it breaks something else just because of being
running).
Well, Andrew has committed the pg_regress extension to allow buildfarm
Tom Lane wrote:
Florian G. Pflug [EMAIL PROTECTED] writes:
Currently, we do not assume that either the childXids array, nor the xid
cache in the proc array are sorted by ascending xid order. I believe that
we could simplify the code, further reduce the locking requirements, and
enabled a
On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote:
BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error? If we're trying to get rid of embedded-null
problems, seems it must.
It is pointed out on wikipedia that Java sometimes uses to byte pair C0
80
FYI, this has been committed by Tom.
---
ITAGAKI Takahiro wrote:
The GucContext of log_autovacuum is PGC_BACKEND in the CVS HEAD,
but should it be PGC_SIGHUP? We cannot modify the variable on-the-fly
because the
Is anyone working on providing basic regression tests for the different
dictionary types? Seems like the main stumbling block is providing
I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So,
what is better practice to builtin it? Make it installable with regular
Teodor Sigaev [EMAIL PROTECTED] writes:
Is anyone working on providing basic regression tests for the different
dictionary types? Seems like the main stumbling block is providing
I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So,
what is better practice to builtin
ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index.
Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would remain on
the
Tom Lane wrote:
Alvaro Herrera [EMAIL PROTECTED] writes:
I'm not sure I follow. Are you suggesting adding a new function,
similar to pg_class_ownercheck, which additionally checks for temp-ness?
No, I was just suggesting adding the check for temp-ness in cluster()
and cluster_rel() where
Avery Payne wrote:
gt;I thought maybe we can call it COAST, Column-oriented attribute storage
technique, :-)
I like it. :-)a rel=nofollow href=http://www.2ndQuadrant.com;/a I
just wish I would have read this before applying for a project name
at pgfoundry, the current proposal is given
Alvaro Herrera [EMAIL PROTECTED] writes:
I am unsure if I should backpatch to 8.1: the code in cluster.c has
changed, and while it is relatively easy to modify the patch, this is a
rare bug and nobody has reported it in CLUSTER (not many people clusters
temp tables, it seems). Should I patch
I have just noticed that a column txn_start has appeared in
pg_stat_activity since 8.2. It's a good idea, but who chose the name?
We do not use that abbreviation for transaction anywhere else in
Postgres, certainly not in any user-exposed places.
I'm inclined to rename it to xact_start, which is
On Mon, 2007-09-10 at 21:04 -0400, Tom Lane wrote:
I have just noticed that a column txn_start has appeared in
pg_stat_activity since 8.2. It's a good idea, but who chose the name?
Me.
I'm inclined to rename it to xact_start, which is an abbreviation
that we *do* use in the code, and in
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the
Tatsuo Ishii wrote:
I don't understand whole discussion.
Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the
Tatsuo Ishii [EMAIL PROTECTED] writes:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too?
Because there's a standard specifying the Unicode code points *as
numbers*. The mapping from those numbers to UTF8 strings (and other
Tatsuo Ishii wrote:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, select chr(49833) should work fine no?
No. The number corresponding to a given byte pattern depends on the
endianness of
Tatsuo Ishii [EMAIL PROTECTED] writes:
If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too?
Because there's a standard specifying the Unicode code points *as
numbers*. The mapping from those numbers to UTF8 strings (and other
Neil Conway [EMAIL PROTECTED] writes:
I personally find xact to be a less intuitive abbreviation of
transaction than txn, but for the sake of consistency, I agree it is
probably better to use xact_start.
Barring other objections, I'll go make this happen.
regards, tom
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we
Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on bytea instead (providing
a
Jeff Davis wrote:
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken.
Andrew Dunstan [EMAIL PROTECTED] writes:
I'm not sure we are going to be able to catch every path by which
invalid data can get into the database in one release. I suspect we
might need two or three goes at this. (I'm just wondering if the
routines that return cstrings are a possible
Tatsuo Ishii [EMAIL PROTECTED] writes:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on bytea
Tatsuo Ishii [EMAIL PROTECTED] writes:
BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function. Being able to create
a value of type text that is not in the database encoding is simply
broken. Perhaps we could make it work on
dugong has been failing contribcheck repeatably for the last day or so,
with a very interesting symptom: CREATE DATABASE is failing with
ERROR: could not fsync segment 0 of relation 1663/40960/41403: No such file or
directory
ERROR: checkpoint request failed
HINT: Consult recent messages in
On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote:
Isn't the collation a locale issue, not an encoding issue? Is there a
ja_JP.UTF-8 that defines the proper order?
I don't think it helps. The point is, he needs different language's
collation, while PostgreSQL allows only one
Renaming the old thread to more appropriately address the topic:
On Wed, 5 Sep 2007, Kevin Grittner wrote:
Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:
ERROR:
I tried to understand how ts_rank works, but I failed. What does Cover
function do? How does it work? What is the DocRepresentation data
structure like? I can see the definition of the struct, and the
get_docrep function to convert to that format, but by reading those I
can't figure out what the
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.
Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:
80 matches
Mail list logo