date:20070910

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread db

>> Try the sequence below. Then, try to dump and then reload the database.
>> When you try to reload it, you will get an error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
>
> I know this could be a problem (like chr() with invalid byte pattern).

And that's enough of a problem already. We don't need more problems.

> What I really want to know is, read query something like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);
>
> could be a problem (I assume we use C locale).

If convert() produce a sequence of bytes that can't be interpreted as a
string in the server encoding then it's broken. Imho convert() should
return a bytea value. If we hade good encoding/charset support we could do
better, but we can't today.

The above example would work fine if convert() returned a bytea. In the C
locale the string would be compared byte for byte and that's what you get
with bytea values as well.

Strings are not sequences of bytes that can be interpreted in different
ways. That's what bytea values are. Strings are in a specific encoding
always, and in pg that encoding is fixed to a single one for a whole
cluster at initdb time. We should not confuse text with bytea.

/Dennis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis

On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote:
> 
> > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> > > Please show me concrete examples how I could introduce a
> vulnerability
> > > using this kind of convert() usage.
> > 
> > Try the sequence below. Then, try to dump and then reload the
> database.
> > When you try to reload it, you will get an error:
> > 
> > ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
> 
> I know this could be a problem (like chr() with invalid byte pattern).
> What I really want to know is, read query something like this:
> 
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);

I guess I don't quite understand the question.

I agree that ORDER BY convert() must be safe in the C locale, because it
just passes the strings to strcmp(). 

Are you saying that we should not remove convert() until we can support
multiple locales in one database?

If we make convert() operate on bytea and return bytea, as Tom
suggested, would that solve your use case?

Regards,
Jeff Davis

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout

On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote:
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
> utf8_to_euc_jp);
> 
> Without using convert(), he will get random order of data. This is
> because Kanji characters are in random order in UTF-8, while Kanji
> characters are reasonably ordered in EUC_JP.

The usual way to approach this is to make convert return bytea instead
of text. Then your problems vanish. Bytea can still be sorted, but it
won't be treated as a text string and thus does not need to conform to
the requirements of a text string.

Languages like perl distinguish between "encode" which is text->bytea
and "decode" which is bytea->text. We've got "convert" for oth and that
causes problems.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii



> On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> > Please show me concrete examples how I could introduce a vulnerability
> > using this kind of convert() usage.
> 
> Try the sequence below. Then, try to dump and then reload the database.
> When you try to reload it, you will get an error:
> 
> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd

I know this could be a problem (like chr() with invalid byte pattern).
What I really want to know is, read query something like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
utf8_to_euc_jp);

could be a problem (I assume we use C locale).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Regards,
>   Jeff Davis
> 
> test=> select version();
> 
> version  
> --
>  PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc
> (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)
> (1 row)
> 
> test=> show lc_collate;
>  lc_collate  
> -
>  en_US.UTF-8
> (1 row)
> 
> test=> create table encoding_test(t text);
> CREATE TABLE
> test=> insert into encoding_test values('初');
> INSERT 0 1
> test=> insert into encoding_test values(convert('初' using
> utf8_to_euc_jp));
> INSERT 0 1
> 
> 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Ts_rank internals

2007-09-10 Thread Teodor Sigaev


I tried to understand how ts_rank works, but I failed. What does Cover
function do? How does it work? What is the DocRepresentation data
structure like? I can see the definition of the struct, and the
get_docrep function to convert to that format, but by reading those I
can't figure out what the resulting DocRepresentation looks like.
I wonder if we could get rid of the istrue flag in QueryOperand, and use
a local BitmapSet variable instead? It seems wrong to have a temporary
flag that's only used in one function, in a struct that's used everywhere.

It's a play around CDR algorithms (Cover Density Ranking).

Based on paper Clarke et al., “Relevance Ranking for One to Three Term Queries.” 
 " (http://citeseer.ist.psu.edu/clarke00relevance.html. Sorry, I lost the 
article itself, but may be Oleg has it. Simple and short description is placed 
at http://www2002.org/CDROM/refereed/643/node7.html.


We change original algorithm to support weight of lexeme, details are on Oleg's 
site: http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking


Array of DocRepresentation is a representation of document, it contains only 
lexemes from both tsvector and tsquery, and lexemes are ordered by position - as 
in original doc. Each DocRepresentation has links to corresponding QueryOperand 
  to optimize query execution while extent search. When we enlarge current 
extent for one word then we set istrue flag for corresponding QueryOperand and 
execution tsquery from cover becomes very simple task.


It's possible to eliminate istrue flag, but it's needed to implement algorithm 
to execute tsquery over continuos part of document, not over whole document.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis

On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote:
> Please show me concrete examples how I could introduce a vulnerability
> using this kind of convert() usage.

Try the sequence below. Then, try to dump and then reload the database.
When you try to reload it, you will get an error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xbd

Regards,
Jeff Davis

test=> select version();

version  
--
 PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)
(1 row)

test=> show lc_collate;
 lc_collate  
-
 en_US.UTF-8
(1 row)

test=> create table encoding_test(t text);
CREATE TABLE
test=> insert into encoding_test values('初');
INSERT 0 1
test=> insert into encoding_test values(convert('初' using
utf8_to_euc_jp));
INSERT 0 1



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

[HACKERS] Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

2007-09-10 Thread Greg Smith


Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:


Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from the LRU end of the cache to look
for dirty pages to write, correct?


This is alluding to the suggestions I gave at 
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php


checkpoint_completion_target has nothing to do with the LRU, so let's step 
back to fundamentals and talk about what it actually does.  The official 
documentation is at 
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html


As you generate transactions, Postgres puts data into the WAL.  The WAL is 
organized into segments that are typically 16MB each.  Periodically, the 
system hits a checkpoint where the WAL data up to a certain point is 
guaranteed to have been applied to the database, at which point the old 
WAL files aren't needed anymore and can be reused.  These checkpoints are 
generally caused by one of two things happening:


1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last 
checkpoint


The system doesn't stop working while the checkpoint is happening; it just 
keeps creating new WAL files.  As long as the checkpoint finishes in 
advance of what the next one is required things performance should be 
fine.


In the 8.2 model, processing the checkpoint occurs as fast as data can be 
written to disk.  In 8.3, the writes can be spread out instead.  What 
checkpoint_completion_target does is suggest how far along the system 
should aim to have finished the current checkpoint relative to when the 
next one is expected.


For example, your current system has checkpoint_segments=10.  Assume that 
you have checkpoint_timeout set to a large number such that the 
checkpoints are typically being driven by the number of segments being 
filled (so you get a checkpoint every 10 WAL segments, period).  If 
checkpoint_completion_target was set to 0.5, the expectation is that the 
writes for the currently executing checkpoint would be finished about the 
time that 0.5*10=5 segments of new WAL data had been written.  If you set 
it to 0.9 instead, you'd expect the checkpoint is finishing just about 
when the 9th WAL segment is being written out, which is cutting things a 
bit tight; somewhere around there is the safe upper limit for that 
parameter.


Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that 
on your current system that's forcing very regular checkpoints, which 
makes each individual checkpoint have less work to do and therefore 
reduces the impact of the spikes you're trying to avoid.  With LDC and 
checkpoint_completion_target, you can make that number much bigger (I 
suggested 50), which means you'll only have 1/5 as many checkpoints 
causing I/O spikes, and each of those checkpoints will have 5X as long to 
potentially spread the writes over.  The main cost is that it will take 
longer to recover if your database crashes, which hopefully is a rare 
event.


Having far less checkpoints is obviously a win for your situation, but the 
open question is whether this fashion of spreading them out will reduce 
the I/O spike as effectively as the all-scan background writer in 8.2 has 
been working for you.  This is one aspect that makes your comparision a 
bit tricky.  It's possible that by increasing the segments enough, you'll 
get into a situation where you don't see (m)any of them during your 
testing run of 8.3.  You should try and collect some data on how regularly 
checkpoints are happening during early testing to get an idea if this is a 
possibility.  The usual approach is to set checkpoint_warning to a really 
high number (like the maximum of 3600) and then you'll get a harmless note 
in the logs every time one happens, and that will show you how frequently 
they're happening.  It's kind of important to have an idea how many 
checkpoints you can expect during each test run to put together a fair 
comparison; as you increase checkpoint_segments, you need to adopt a 
mindset that is considering "how many sluggish transactions am I seeing 
per checkpoint?", not how many total per test run.


I have a backport of some of the pg_stat_bgwriter features added to 8.3 
that can be applied to 8.2 that might be helpful for monitoring your test 
benchmarking server (this is most certainly *not* suitable to go onto the 
real one) at 
http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might 
want to take a look at; I put that together specifically for allowing 
easier comparisions of 8.2 and 8.3 in this area.


Are the current shared memory and the 1 GB you suggested enough of a 
spread for these tests?  (At several hours per test in order to get 
meaningful results, I don't want to get into too many p

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis

On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote:
> > Isn't the collation a locale issue, not an encoding issue? Is there a
> > ja_JP.UTF-8 that defines the proper order?
> 
> I don't think it helps. The point is, he needs different language's
> collation, while PostgreSQL allows only one collation(locale) per
> database cluster.

My thought was: if we had some function that treated a string as a
different locale, it might solve the problem without violating the
database encoding.

Converting to a different encoding, and producing text result, is the
source of the problem.

Regards,
Jeff Davis

---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] What is happening on buildfarm member dugong?

2007-09-10 Thread Tom Lane

dugong has been failing contribcheck repeatably for the last day or so,
with a very interesting symptom: CREATE DATABASE is failing with

ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or 
directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.

I'd think we broke something in fsync cancellation signalling, except
that AFAICS there were no CVS commits at all between the last working
build at 2007-09-09 23:05:01 UTC and the first failure at 2007-09-10
00:45:27 UTC (and even if I got the timezone conversion wrong, there are
no nearby commits in the backend/storage area).  Has this machine had
any system-software updates around then?  Can anyone suggest another
theory?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> >> BTW, it strikes me that there is another hole that we need to plug in
> >> this area, and that's the convert() function.  Being able to create
> >> a value of type text that is not in the database encoding is simply
> >> broken.  Perhaps we could make it work on bytea instead (providing
> >> a cast from text to bytea but not vice versa), or maybe we should just
> >> forbid the whole thing if the database encoding isn't SQL_ASCII.
> 
> > Please don't do that. It will break an usefull use case of convert().
> 
> The reason we have a problem here is that we've been choosing
> convenience over safety in encoding-related issues.  I wonder if we must
> stoop to having a "strict_encoding_checks" GUC variable to satisfy
> everyone.

Please show me concrete examples how I could introduce a vulnerability
using this kind of convert() usage.

> > A user has a database encoded in UTF-8. He has English, French,
> > Chinese  and Japanese data in tables. To sort the tables in the
> > language order, he will do like this:
> 
> > SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
> > utf8_to_euc_jp);
> 
> > Without using convert(), he will get random order of data.
> 
> I'd say that *with* convert() he will get a random order of data.  This
> is making a boatload of unsupportable assumptions about the locale and
> encoding of the surrounding database.  There are a lot of bad-encoding
> situations for which strcoll() simply breaks down completely and can't
> even deliver self-consistent answers.
> 
> It might work the way you are expecting if the database uses SQL_ASCII
> encoding and C locale --- and I'd be fine with allowing convert() only
> when the database encoding is SQL_ASCII.

I don't believe that. With C locale, the convert() works fine as I
described.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Tatsuo Ishii <[EMAIL PROTECTED]> writes:
>> BTW, it strikes me that there is another hole that we need to plug in
>> this area, and that's the convert() function.  Being able to create
>> a value of type text that is not in the database encoding is simply
>> broken.  Perhaps we could make it work on bytea instead (providing
>> a cast from text to bytea but not vice versa), or maybe we should just
>> forbid the whole thing if the database encoding isn't SQL_ASCII.

> Please don't do that. It will break an usefull use case of convert().

The reason we have a problem here is that we've been choosing
convenience over safety in encoding-related issues.  I wonder if we must
stoop to having a "strict_encoding_checks" GUC variable to satisfy
everyone.

> A user has a database encoded in UTF-8. He has English, French,
> Chinese  and Japanese data in tables. To sort the tables in the
> language order, he will do like this:

> SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
> utf8_to_euc_jp);

> Without using convert(), he will get random order of data.

I'd say that *with* convert() he will get a random order of data.  This
is making a boatload of unsupportable assumptions about the locale and
encoding of the surrounding database.  There are a lot of bad-encoding
situations for which strcoll() simply breaks down completely and can't
even deliver self-consistent answers.

It might work the way you are expecting if the database uses SQL_ASCII
encoding and C locale --- and I'd be fine with allowing convert() only
when the database encoding is SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> I'm not sure we are going to be able to catch every path by which 
> invalid data can get into the database in one release. I suspect we 
> might need two or three goes at this. (I'm just wondering if the 
> routines that return cstrings are a possible vector).

We need to have a policy that cstrings are in the database encoding.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Jeff Davis wrote:

On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
  

BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function.  Being able to create
a value of type text that is not in the database encoding is simply
broken.  Perhaps we could make it work on bytea instead (providing
a cast from text to bytea but not vice versa), or maybe we should just
forbid the whole thing if the database encoding isn't SQL_ASCII.
  

Please don't do that. It will break an usefull use case of convert().

A user has a database encoded in UTF-8. He has English, French,
Chinese  and Japanese data in tables. To sort the tables in the
language order, he will do like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
utf8_to_euc_jp);

Without using convert(), he will get random order of data. This is
because Kanji characters are in random order in UTF-8, while Kanji
characters are reasonably ordered in EUC_JP.



Isn't the collation a locale issue, not an encoding issue? Is there a
ja_JP.UTF-8 that defines the proper order?


  


That won't help you much if you have all the collection mentioned above.

cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tatsuo Ishii wrote:

BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function.  Being able to create
a value of type text that is not in the database encoding is simply
broken.  Perhaps we could make it work on bytea instead (providing
a cast from text to bytea but not vice versa), or maybe we should just
forbid the whole thing if the database encoding isn't SQL_ASCII.



Please don't do that. It will break an usefull use case of convert().

A user has a database encoded in UTF-8. He has English, French,
Chinese  and Japanese data in tables. To sort the tables in the
language order, he will do like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
utf8_to_euc_jp);

Without using convert(), he will get random order of data. This is
because Kanji characters are in random order in UTF-8, while Kanji
characters are reasonably ordered in EUC_JP.

  


Tatsuo-san,

would not this case be at least as well met by an operator supplying the 
required ordering? The operator of course would not have the danger of 
supplying values that are invalid in the database encoding. Admittedly, 
the user might need several operators for the case you describe.


I'm not sure we are going to be able to catch every path by which 
invalid data can get into the database in one release. I suspect we 
might need two or three goes at this. (I'm just wondering if the 
routines that return cstrings are a possible vector).


cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
> > > BTW, it strikes me that there is another hole that we need to plug in
> > > this area, and that's the convert() function.  Being able to create
> > > a value of type text that is not in the database encoding is simply
> > > broken.  Perhaps we could make it work on bytea instead (providing
> > > a cast from text to bytea but not vice versa), or maybe we should just
> > > forbid the whole thing if the database encoding isn't SQL_ASCII.
> > 
> > Please don't do that. It will break an usefull use case of convert().
> > 
> > A user has a database encoded in UTF-8. He has English, French,
> > Chinese  and Japanese data in tables. To sort the tables in the
> > language order, he will do like this:
> > 
> > SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
> > utf8_to_euc_jp);
> > 
> > Without using convert(), he will get random order of data. This is
> > because Kanji characters are in random order in UTF-8, while Kanji
> > characters are reasonably ordered in EUC_JP.
> 
> Isn't the collation a locale issue, not an encoding issue? Is there a
> ja_JP.UTF-8 that defines the proper order?

I don't think it helps. The point is, he needs different language's
collation, while PostgreSQL allows only one collation(locale) per
database cluster.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Jeff Davis

On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote:
> > BTW, it strikes me that there is another hole that we need to plug in
> > this area, and that's the convert() function.  Being able to create
> > a value of type text that is not in the database encoding is simply
> > broken.  Perhaps we could make it work on bytea instead (providing
> > a cast from text to bytea but not vice versa), or maybe we should just
> > forbid the whole thing if the database encoding isn't SQL_ASCII.
> 
> Please don't do that. It will break an usefull use case of convert().
> 
> A user has a database encoded in UTF-8. He has English, French,
> Chinese  and Japanese data in tables. To sort the tables in the
> language order, he will do like this:
> 
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
> utf8_to_euc_jp);
> 
> Without using convert(), he will get random order of data. This is
> because Kanji characters are in random order in UTF-8, while Kanji
> characters are reasonably ordered in EUC_JP.

Isn't the collation a locale issue, not an encoding issue? Is there a
ja_JP.UTF-8 that defines the proper order?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Tom Lane

Neil Conway <[EMAIL PROTECTED]> writes:
> I personally find "xact" to be a less intuitive abbreviation of
> "transaction" than "txn", but for the sake of consistency, I agree it is
> probably better to use "xact_start".

Barring other objections, I'll go make this happen.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> > If you regard the unicode code point as simply a number, why not
> > regard the multibyte characters as a number too?
> 
> Because there's a standard specifying the Unicode code points *as
> numbers*.  The mapping from those numbers to UTF8 strings (and other
> representations) is well-defined by the standard.
> 
> > Also I'm wondering you what we should do with different
> > backend/frontend encoding combo.
> 
> Nothing.  chr() has always worked with reference to the database
> encoding, and we should keep it that way.

Where is it documented?

> BTW, it strikes me that there is another hole that we need to plug in
> this area, and that's the convert() function.  Being able to create
> a value of type text that is not in the database encoding is simply
> broken.  Perhaps we could make it work on bytea instead (providing
> a cast from text to bytea but not vice versa), or maybe we should just
> forbid the whole thing if the database encoding isn't SQL_ASCII.

Please don't do that. It will break an usefull use case of convert().

A user has a database encoded in UTF-8. He has English, French,
Chinese  and Japanese data in tables. To sort the tables in the
language order, he will do like this:

SELECT * FROM japanese_table ORDER BY convert(japanese_text using 
utf8_to_euc_jp);

Without using convert(), he will get random order of data. This is
because Kanji characters are in random order in UTF-8, while Kanji
characters are reasonably ordered in EUC_JP.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tatsuo Ishii wrote:


If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

  


No. The number corresponding to a given byte pattern depends on the 
endianness of the architecture. That's exactly why we can't sanely use 
the byte pattern of the encoded characters as numbers.


cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> If you regard the unicode code point as simply a number, why not
> regard the multibyte characters as a number too?

Because there's a standard specifying the Unicode code points *as
numbers*.  The mapping from those numbers to UTF8 strings (and other
representations) is well-defined by the standard.

> Also I'm wondering you what we should do with different
> backend/frontend encoding combo.

Nothing.  chr() has always worked with reference to the database
encoding, and we should keep it that way.

BTW, it strikes me that there is another hole that we need to plug in
this area, and that's the convert() function.  Being able to create
a value of type text that is not in the database encoding is simply
broken.  Perhaps we could make it work on bytea instead (providing
a cast from text to bytea but not vice versa), or maybe we should just
forbid the whole thing if the database encoding isn't SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> Tatsuo Ishii wrote:
> >
> > I don't understand whole discussion.
> >
> > Why do you think that employing the Unicode code point as the chr()
> > argument could avoid endianness issues? Are you going to represent
> > Unicode code point as UCS-4? Then you have to specify the endianness
> > anyway.  (see the UCS-4 standard for more details)
> >   
> 
> The code point is simply a number. The result of chr() will be a text 
> value one char (not one byte) wide, in the relevant database encoding.
>
> U+ maps to the same Unicode char and hence the same UTF8 encoding 
> pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
> all machines. So to get this char in a UTF8 database you could call 
> "select chr(169)" and get back the byte pattern \xC2A9.

If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

Also I'm wondering you what we should do with different
backend/frontend encoding combo. For example, if your database is in
UTF-8, and your client encoding is LATIN2, what integer value should
be passed to chr()? LATIN2 or Unicode code point?

> > Or are you going to represent Unicode point as a character string such
> > as 'U+0259'? Then representing any encoding as a string could avoid
> > endianness issues anyway, and I don't see Unicode code point is any
> > better than others.
> >   
> 
> The argument will be a number, as now.
> 
> > Also I'd like to point out all encodings has its own code point
> > systems as far as I know. For example, EUC-JP has its corresponding
> > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> > we can't use "code point" as chr()'s argument for othe encodings(of
> > course we need optional parameter specifying which character set is
> > supposed).
> >   
> 
> Where can I find the tables that map code points (as opposed to 
> encodings) to characters for these others?

You mean code point table of character set? The actual standard is not
on the web since it is copyrighted by the Japanese goverment (you need
to buy as a book or a pdf file). However you could find many code
point tables on the web. For example, JIS X 0208 code points can be
found on:

http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html

(you need to have a Japanese font and set page encoding to Shift JIS)

BTW, if you want to pass "code point of character set" rather than
encoding value, you need to give chr() what character set you are
reffering to. So we need to have two arguments, one is for code point,
the other is for character set specification. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> Tatsuo Ishii wrote:
> >
> > I don't understand whole discussion.
> >
> > Why do you think that employing the Unicode code point as the chr()
> > argument could avoid endianness issues? Are you going to represent
> > Unicode code point as UCS-4? Then you have to specify the endianness
> > anyway.  (see the UCS-4 standard for more details)
> >   
> 
> The code point is simply a number. The result of chr() will be a text 
> value one char (not one byte) wide, in the relevant database encoding.
>
> U+ maps to the same Unicode char and hence the same UTF8 encoding 
> pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
> all machines. So to get this char in a UTF8 database you could call 
> "select chr(169)" and get back the byte pattern \xC2A9.

If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

Also I'm wondering you what we should do with different
backend/frontend encoding combo. For example, if your database is in
UTF-8, and your client encoding is LATIN2, what integer value should
be passed to chr()? LATIN2 or Unicode code point?

> > Or are you going to represent Unicode point as a character string such
> > as 'U+0259'? Then representing any encoding as a string could avoid
> > endianness issues anyway, and I don't see Unicode code point is any
> > better than others.
> >   
> 
> The argument will be a number, as now.
> 
> > Also I'd like to point out all encodings has its own code point
> > systems as far as I know. For example, EUC-JP has its corresponding
> > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> > we can't use "code point" as chr()'s argument for othe encodings(of
> > course we need optional parameter specifying which character set is
> > supposed).
> >   
> 
> Where can I find the tables that map code points (as opposed to 
> encodings) to characters for these others?

You mean code point table of character set? The actual standard is not
on the web since it is copyrighted by the Japanese goverment (you need
to buy as a book or a pdf file). However you could find many code
point tables on the web. For example, JIS X 0208 code points can be
found on:

http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html

(you need to have a Japanese font and set page encoding to Shift JIS)

BTW, if you want to pass "code point of character set" rather than
encoding value, you need to give chr() what character set you are
reffering to. So we need to have two arguments, one is for code point,
the other is for character set specification. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Neil Conway

On Mon, 2007-09-10 at 21:04 -0400, Tom Lane wrote:
> I have just noticed that a column "txn_start" has appeared in
> pg_stat_activity since 8.2.  It's a good idea, but who chose the name?

Me.

> I'm inclined to rename it to "xact_start", which is an abbreviation
> that we *do* use in the code, and in some user-visible places, eg,
> pg_stat_database has "xact_commit" and "xact_rollback" columns.

I personally find "xact" to be a less intuitive abbreviation of
"transaction" than "txn", but for the sake of consistency, I agree it is
probably better to use "xact_start".

-Neil



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

[HACKERS] "txn" in pg_stat_activity

2007-09-10 Thread Tom Lane

I have just noticed that a column "txn_start" has appeared in
pg_stat_activity since 8.2.  It's a good idea, but who chose the name?
We do not use that abbreviation for "transaction" anywhere else in
Postgres, certainly not in any user-exposed places.

I'm inclined to rename it to "xact_start", which is an abbreviation
that we *do* use in the code, and in some user-visible places, eg,
pg_stat_database has "xact_commit" and "xact_rollback" columns.

Other suggestions?  If we don't change it before 8.3 beta starts,
we'll be stuck with it ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [ADMIN] reindexdb hangs

2007-09-10 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> I am unsure if I should backpatch to 8.1: the code in cluster.c has
> changed, and while it is relatively easy to modify the patch, this is a
> rare bug and nobody has reported it in CLUSTER (not many people clusters
> temp tables, it seems).  Should I patch only REINDEX?  How far back?

I'd say go as far back as you can conveniently modify the patch for.
This is a potential data-loss bug (even if only for temporary data)
so we ought to take it seriously.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera

Avery Payne wrote:

> >I thought maybe we can call it COAST, Column-oriented attribute storage 
> technique, :-)
> 
> I like it. :-)http://www.2ndQuadrant.com";>  I 
> just wish I would have read this before applying for a project name
> at pgfoundry, the current proposal is given as "pg-cstore".

You can email the pgfoundry admins at [EMAIL PROTECTED] and
ask it to be cancelled instead of approved, and resubmit with the other
name.

-- 
Alvaro Herrera   http://www.PlanetPostgreSQL.org/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] [ADMIN] reindexdb hangs

2007-09-10 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera <[EMAIL PROTECTED]> writes:
> > I'm not sure I follow.  Are you suggesting adding a new function,
> > similar to pg_class_ownercheck, which additionally checks for temp-ness?
> 
> No, I was just suggesting adding the check for temp-ness in cluster()
> and cluster_rel() where we do pg_class_ownercheck.  We already have the
> rel open there and so it's cheap to do the temp-ness check.

I applied a patch along these lines to HEAD and 8.2.

I am unsure if I should backpatch to 8.1: the code in cluster.c has
changed, and while it is relatively easy to modify the patch, this is a
rare bug and nobody has reported it in CLUSTER (not many people clusters
temp tables, it seems).  Should I patch only REINDEX?  How far back?

-- 
Alvaro Herrera  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Avery Payne







>ISTM we would be able to do this fairly well if we implemented
>Index-only columns. i.e. columns that don't exist in the heap, only in
>an index. 

>Taken to the extreme, all columns could be removed from the heap and
>placed in an index(es). Only the visibility information would remain on
>the heap.

So, let me understand this correctly - you want to index the columns and
use the index to reconstruct the data?  Some kind of "implicit" reconstruction?

>Doing this per column would be a big win over vertical databases
>since AFAIK they *have* to do this to *every* column, even if it is not
>beneficial to do so.



I was thinking about something a little more crude - each column being a free-standing
table, but being "viewed" by the client as a single entity, a kind of "data federation".
The idea was that the existing storage mechanism wouldn't be altered, and
with a little slight-of-hand, we could extend the mechanism without hampering things
like clustering, future extensions, optimizations, etc.  No changes to MVCC would be
needed, because it would above and through it.

The idea was that for a "wide" table you target only a few columns, so you could
sequential read without the penalty of having to seek to a new offset for each record.
Instead of processing all the columns, you process a single column, which means less
data to read for the same number of records.  That gain starts to slope off
when you specify more and more columns from the table.

Throw in selective indexing (similar to what you were talking about) and suddenly we can reclaim
some of that lost speed.  We'll make a kind of "compressed index", where the key turns into
a hash that points to (is attached to?) a bucket, and the bucket contains the offset of all the
records that relate to that key.  Other tricks can be employed, such as run-length encoding
entire ranges of offsets, etc. to keep this really really small.  Really small = faster and faster
to read, using more CPU than I/O.  And I/O is still more of an issue than CPU at this point.

Then again, if you ditch the column altogether and use a "compressed index" to reconstruct
data implicitly, now we're close to what you were talking about (assuming I understand you
correctly and also assuming that PostgreSQL doesn't already do this with indexes).  So, if the
column is indexed, then maybe split it off into a "compressed index", and if not, keep it in the main
table outright?

I guess I really need to think about this a bit more before I start delving into code.



>I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-)

I like it. :-)  I just wish I would have read this before applying for a project name
at pgfoundry, the current proposal is given as "pg-cstore".

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Tom Lane

Teodor Sigaev <[EMAIL PROTECTED]> writes:
>> Is anyone working on providing basic regression tests for the different
>> dictionary types?  Seems like the main stumbling block is providing

> I make some small tests  (http://www.sigaev.ru/misc/ispell_samples.tgz). So, 
> what is better practice to builtin it?  Make it installable with regular 
> procedure into share/tsearch_data or install they only for regression db?

You have to install them as part of the regular installation; pg_regress
can't do it in the "make installcheck" case because it may not have
write permissions on $SHAREDIR.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Teodor Sigaev


Is anyone working on providing basic regression tests for the different
dictionary types?  Seems like the main stumbling block is providing


I make some small tests  (http://www.sigaev.ru/misc/ispell_samples.tgz). So, 
what is better practice to builtin it?  Make it installable with regular 
procedure into share/tsearch_data or install they only for regression db?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] GucContext of log_autovacuum

2007-09-10 Thread Bruce Momjian


FYI, this has been committed by Tom.

---

ITAGAKI Takahiro wrote:
> The GucContext of log_autovacuum is PGC_BACKEND in the CVS HEAD,
> but should it be PGC_SIGHUP? We cannot modify the variable on-the-fly
> because the parameter is used only by autovacuum worker processes.
> The similar variables, like autovacuum_vacuum_scale_factor, are
> defined as PGC_SIGHUP.
> 
> 
> Index: src/backend/utils/misc/guc.c
> ===
> --- src/backend/utils/misc/guc.c  (head)
> +++ src/backend/utils/misc/guc.c  (working copy)
> @@ -1552,7 +1552,7 @@
>   },
>  
>   {
> - {"log_autovacuum", PGC_BACKEND, LOGGING_WHAT,
> + {"log_autovacuum", PGC_SIGHUP, LOGGING_WHAT,
>   gettext_noop("Sets the minimum execution time above 
> which autovacuum actions "
>"will be logged."),
>   gettext_noop("Zero prints all actions.  The default is 
> -1 (turning this feature off)."),
> 
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
> 
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout

On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote:
> BTW, I'm sure this was discussed but I forgot the conclusion: should
> chr(0) throw an error?  If we're trying to get rid of embedded-null
> problems, seems it must.

It is pointed out on wikipedia that Java sometimes uses to byte pair C0
80 to represent the NUL character to allow it to be embedded in C
strings without tripping anything up. It is however technically an
illegal representation (violates the minimal representation rule) and
thus rejected by postgres. I'm not suggesting we copy this, but it does
show there are other ways to deal with this.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-10 Thread Florian G. Pflug


Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Currently, we do not assume that either the childXids array, nor the xid
cache in the proc array are sorted by ascending xid order. I believe that
we could simplify the code, further reduce the locking requirements, and
enabled a transaction to de-overflow it's xid cache if we assume that those
arrays are in ascending xid order.


"de-overflowing" the cache sounds completely unsafe, as other backends need
that state to determine whether they need to look into pg_subtrans.


We'd only de-overflow if we abort *all* xids that are missing from the
xid cache. And only after marking them as aborted in the clog. If someone
concurrently checks for an overflow, and already sees the new (non-overflowed)
state, than he'll assume the xid is not running if he hasn't found it in
the array. Which is correct - we just aborted it.

Plus, removing the exclusive lock doesn't depend on de-overflowing. It's
just something that seems rather easy to do once the subxid handling is
in a state that allows concurrent removal of entries. If it turns out that
it's not that easy, than I'll just drop the idea again.

I still don't believe you can avoid taking exclusive lock, either; your 
argument here did not address latestCompletedXid.


Sorry, not addressing latestCompletedXid was an oversight :-(.
My point is the we only *need* to advance latestCompletedXid on COMMITS. We do
so for aborts only to avoid running with unnecessarily low xmins after
a transaction ABORT. That corner case can only happen after a toplevel
ABORT, though - aborting subxacts cannot change the xmin, because the
toplevel xact will have a lower xid than any of it's subtransactions anyway.

We can therefore just remember the largest assigned xid for a given transaction,
and update latestCompletedXid to that on toplevel commit or abort. That
prevents that corner-case too, without updating latestCompletedXid during
subxact abort.


But the main point remains this: there is no evidence whatsoever that these
code paths are sufficiently performance-critical to be worth speeding up by
making the code more fragile.


The gain will be less than that of the locking improvements done so far.
It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks,
though I think.

We'll also save some cycles in TransactionIdIsInProgress, because we can
use a binary search, but that's just an added bonus.

I'm currently trying to code up a patch, since it's easier to judge the
correctness of actual code than that of a mere proposals. I'll do some
benchmarking when the patch is done to see if it brings measurable benefits.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Are we done with sync-commit-defaults-to-off patch?

2007-09-10 Thread Andrew Dunstan




Tom Lane wrote:

Alvaro Herrera <[EMAIL PROTECTED]> writes:
  

Maybe we should lower the autovac naptime too, just to make it do some
more stuff (and to see if it breaks something else just because of being
running).



Well, Andrew has committed the pg_regress extension to allow buildfarm
animals to set nondefault GUC values.  So I think it'd be sufficient if
a buildfarm owner or two would volunteer to run with small autovac
naptime.

  


What settings do you have in mind for a testing regime?

If anyone wants help on setting this up for a buildfarm member, please 
let me know.


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-10 Thread Tom Lane

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:
> Currently, we do not assume that either the childXids array, nor
> the xid cache in the proc array are sorted by ascending xid order.
> I believe that we could simplify the code, further reduce the locking
> requirements, and enabled a transaction to de-overflow it's xid cache
> if we assume that those arrays are in ascending xid order.

"de-overflowing" the cache sounds completely unsafe, as other backends
need that state to determine whether they need to look into pg_subtrans.
I still don't believe you can avoid taking exclusive lock, either; your
argument here did not address latestCompletedXid.

But the main point remains this: there is no evidence whatsoever that
these code paths are sufficiently performance-critical to be worth
speeding up by making the code more fragile.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tom Lane wrote:

OK.  Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing that.  We should not
need it since we only support 4-byte UTF8, hence code points only up to
21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
no?).


  


Yes, that was a thinko on my part. I realised that yesterday.

cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

[HACKERS] Ts_rank internals

2007-09-10 Thread Heikki Linnakangas

Hi,

I tried to understand how ts_rank works, but I failed. What does Cover
function do? How does it work? What is the DocRepresentation data
structure like? I can see the definition of the struct, and the
get_docrep function to convert to that format, but by reading those I
can't figure out what the resulting DocRepresentation looks like.

I wonder if we could get rid of the istrue flag in QueryOperand, and use
a local BitmapSet variable instead? It seems wrong to have a temporary
flag that's only used in one function, in a struct that's used everywhere.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane

Teodor Sigaev <[EMAIL PROTECTED]> writes:
>> I think Teodor's solution is wrong as it stands, because if the subquery
>> finds matches for mapcfg and maptokentype, but none of those rows
>> produce a non-null ts_lexize result, it will instead emit one row with a
>> null result, which is not what should happen.

> But concatenation with NULL will have result NULL, so "Lexized token" column 
> will have NULL as supposed.

Ah, you're right --- objection withdrawn.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> BTW, I'm sure this was discussed but I forgot the conclusion: should
>> chr(0) throw an error?

> I think it should, yes.

OK.  Looking back, there was also some mention of changing chr's
argument to bigint, but I'd counsel against doing that.  We should not
need it since we only support 4-byte UTF8, hence code points only up to
21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points,
no?).

If Tatsuo can find official code-point mappings for any other MB
encodings, +1 on supporting those too.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Teodor Sigaev


I think Teodor's solution is wrong as it stands, because if the subquery
finds matches for mapcfg and maptokentype, but none of those rows
produce a non-null ts_lexize result, it will instead emit one row with a
null result, which is not what should happen.
 But concatenation with NULL will have result NULL, so "Lexized token" column 
will have NULL as supposed.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Uh, how will that help?  AFAICS it still has to call ts_lexize with
>> every dictionary.

> No, ts_lexize is no longer in the seq scan filter, but in the sort key
> that's calculated only for those rows that match the filter 'mapcfg=?
> AND maptokentype=?'. It is pretty kludgey, though.

Oh, I see: in the original formulation the planner can push the "WHERE
dl.lex IS NOT NULL" clause down into the sorted subquery, and what with
one thing and another that clause ends up getting evaluated first in the
scan filter condition.  We could prevent that by increasing ts_lexize's
procost from 1 to 2 (or so), which might be a a good thing anyway since
I suppose it's not especially cheap.  It's still a klugy solution
though.

I think Teodor's solution is wrong as it stands, because if the subquery
finds matches for mapcfg and maptokentype, but none of those rows
produce a non-null ts_lexize result, it will instead emit one row with a
null result, which is not what should happen.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tom Lane wrote:

BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error?  If we're trying to get rid of embedded-null
problems, seems it must.


  


I think it should, yes.

cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tatsuo Ishii wrote:


I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)
  


The code point is simply a number. The result of chr() will be a text 
value one char (not one byte) wide, in the relevant database encoding.


U+ maps to the same Unicode char and hence the same UTF8 encoding 
pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on 
all machines. So to get this char in a UTF8 database you could call 
"select chr(169)" and get back the byte pattern \xC2A9.



Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.
  


The argument will be a number, as now.


Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
  


Where can I find the tables that map code points (as opposed to 
encodings) to characters for these others?


cheers

andrew



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Martijn van Oosterhout

On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote:
> Why do you think that employing the Unicode code point as the chr()
> argument could avoid endianness issues? Are you going to represent
> Unicode code point as UCS-4? Then you have to specify the endianness
> anyway.  (see the UCS-4 standard for more details)

Because the argument to chr() is an integer, which has no endian-ness.
You only get into endian-ness if you look at how you store the
resulting string.

> Also I'd like to point out all encodings has its own code point
> systems as far as I know. For example, EUC-JP has its corresponding
> code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> we can't use "code point" as chr()'s argument for othe encodings(of
> course we need optional parameter specifying which character set is
> supposed).

Oh, the last discussion on this didn't answer this question. Is there a
standard somewhere that maps integers to characters in EUC-JP. If so,
how can I find out what character 512 is?

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

BTW, I'm sure this was discussed but I forgot the conclusion: should
chr(0) throw an error?  If we're trying to get rid of embedded-null
problems, seems it must.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote:
> Oleg Bartunov <[EMAIL PROTECTED]> writes:
> > On Mon, 10 Sep 2007, Simon Riggs wrote:
> >> Can we include that functionality now?
> 
> > This could be realized very easyly using dict_strict, which returns
> > only known words, and mapping contains only this dictionary. So, 
> > feel free to write it and submit.
> 
> ... for 8.4.

I've coded a small patch to allow CaseSensitive synonyms.

  CREATE TEXT SEARCH DICTIONARY my_diction (
 TEMPLATE = biglist,
 DictFile = words,
 CaseSensitive = true
  );

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com
Index: src/backend/tsearch/dict_synonym.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/dict_synonym.c,v
retrieving revision 1.4
diff -c -r1.4 dict_synonym.c
*** src/backend/tsearch/dict_synonym.c	25 Aug 2007 02:29:45 -	1.4
--- src/backend/tsearch/dict_synonym.c	10 Sep 2007 15:14:21 -
***
*** 29,34 
--- 29,35 
  typedef struct
  {
  	int			len;	/* length of syn array */
+ 	bool		case_sensitive;
  	Syn		   *syn;
  } DictSyn;
  
***
*** 83,88 
--- 84,90 
  			   *end = NULL;
  	int			cur = 0;
  	char	   *line = NULL;
+ 	bool		case_sensitive = false;
  
  	foreach(l, dictoptions)
  	{
***
*** 95,100 
--- 97,107 
  	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  	 errmsg("unrecognized synonym parameter: \"%s\"",
  			defel->defname)));
+ 
+ 		if (pg_strcasecmp("CaseSensitive", defel->defname) == 0 &&
+ 			pg_strcasecmp("True", defGetString(defel)) == 0)
+ 			case_sensitive = true;
+ 
  	}
  
  	if (!filename)
***
*** 168,173 
--- 175,182 
  	d->len = cur;
  	qsort(d->syn, d->len, sizeof(Syn), compareSyn);
  
+ 	d->case_sensitive = case_sensitive;
+ 
  	PG_RETURN_POINTER(d);
  }
  
***
*** 180,195 
  	Syn			key,
  			   *found;
  	TSLexeme   *res;
  
  	/* note: d->len test protects against Solaris bsearch-of-no-items bug */
  	if (len <= 0 || d->len <= 0)
  		PG_RETURN_POINTER(NULL);
  
! 	key.in = lowerstr_with_len(in, len);
  	key.out = NULL;
  
  	found = (Syn *) bsearch(&key, d->syn, d->len, sizeof(Syn), compareSyn);
! 	pfree(key.in);
  
  	if (!found)
  		PG_RETURN_POINTER(NULL);
--- 189,214 
  	Syn			key,
  			   *found;
  	TSLexeme   *res;
+ 	bool		need_pfree = false;
  
  	/* note: d->len test protects against Solaris bsearch-of-no-items bug */
  	if (len <= 0 || d->len <= 0)
  		PG_RETURN_POINTER(NULL);
  
! 	if (d->case_sensitive)
! 		key.in = in;
! 	else
! 	{
! 		key.in = lowerstr_with_len(in, len);
! 		need_pfree = true;
! 	}
! 
  	key.out = NULL;
  
  	found = (Syn *) bsearch(&key, d->syn, d->len, sizeof(Syn), compareSyn);
! 
! 	if (need_pfree)
! 		pfree(key.in);
  
  	if (!found)
  		PG_RETURN_POINTER(NULL);

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tatsuo Ishii

> Andrew Dunstan <[EMAIL PROTECTED]> writes:
> > The reason we are prepared to make an exception for Unicode is precisely 
> > because the code point maps to an encoding pattern independently of 
> > architecture, ISTM.
> 
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode.  And it's also clear what to do in
> single-byte encodings.  It's not at all clear what the representation
> ought to be for other multibyte encodings.  A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
> 
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too.  If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.

I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Martijn van Oosterhout

On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote:
> I think that if the case of >1 entry per hash becomes common enough to 
> be significant, and the key is stored in the hash, that a btree will 
> perform equal or better, and there is no point in pursuing  such a hash 
> index model. This is where we are today.

There is the point that if a user does an UPDATE of a row without
changing the key your index will have to store entries with the same
hash. If your goal is mostly write-once tables, then that's cool, but
otherwise you're going to have to find a way of dealing with that. It's
not just going to happen a lot, it is going to be common, even for
unique indexes.

Presumably HOT will help with this, but that relies on all index columns
not to change.

The major benenfits will mostly come from not storing the key at all. I
think.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

signature.asc
Description: Digital signature

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Alvaro Herrera

Mark Mielke wrote:
> Simon Riggs wrote:
>> ISTM we would be able to do this fairly well if we implemented
>> Index-only columns. i.e. columns that don't exist in the heap, only in
>> an index. 
>> Taken to the extreme, all columns could be removed from the heap and
>> placed in an index(es). Only the visibility information would remain on
>> the heap.
>>   
> Wouldn't the extreme be - even the visibility information is in the index? 

Maybe we could extract the visibility information and put it in a
storage separate from the heap.  A seqscan would be a bit slower (have
to check the heap and the visibility storage) but some index scans could
be faster (can check the index and visibility storage, skipping the
heap), and updates would be faster too (append into the heap, update
visibility storage).  This would allow something closer to index-only
scans.  Of course, the key is to allow efficient lookup of visibility
info given a TID.

Has this been explored before?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Heikki Linnakangas

Tom Lane wrote:
> Teodor Sigaev <[EMAIL PROTECTED]> writes:
>>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
>>> $1). That means that it will call ts_lexize on every dictionary, which
>>> will try to load every dictionary. And loading danish_stem dictionary
>>> fails in latin2 encoding, because of the problem with the stopword file.
> 
>> Attached patch should fix it, I hope.
> 
> Uh, how will that help?  AFAICS it still has to call ts_lexize with
> every dictionary.

No, ts_lexize is no longer in the seq scan filter, but in the sort key
that's calculated only for those rows that match the filter 'mapcfg=?
AND maptokentype=?'. It is pretty kludgey, though. The planner could
choose another plan, that fails, if the statistics were different.
Rewriting the function in C would be a more robust fix.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Mark Mielke


Simon Riggs wrote:

ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index. 


Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would remain on
the heap.
  
Wouldn't the extreme be - even the visibility information is in the 
index? :-)


Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Tom Lane

Teodor Sigaev <[EMAIL PROTECTED]> writes:
>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
>> $1). That means that it will call ts_lexize on every dictionary, which
>> will try to load every dictionary. And loading danish_stem dictionary
>> fails in latin2 encoding, because of the problem with the stopword file.

> Attached patch should fix it, I hope.

Uh, how will that help?  AFAICS it still has to call ts_lexize with
every dictionary.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Mark Mielke


Kenneth Marshall wrote:

On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
  

Kenneth Marshall <[EMAIL PROTECTED]> writes:


... This is the rough plan. Does anyone see anything critical that
is missing at this point?
  

Sounds pretty good.  Let me brain-dump one item on you: one thing that
hash currently has over btree is the ability to handle index items up
to a full page.  Now, if you go with a scheme that only stores hash
codes and not the underlying data, you can not only handle that but
improve on it; but if you reduce the bucket size and don't remove the
data, it'd be a step backward.  The idea I had about dealing with that
was to only reduce the size of primary buckets --- if it's necessary to
add overflow space to a bucket, the overflow units are still full pages.
So an index tuple up to a page in size can always be accommodated by
adding an overflow page to the bucket.

Just a thought, but AFAIR it's not in the archives anywhere.

regards, tom lane



I was thinking about this some more, and it strikes me that we can
keep the page size = bucket size = overflow size in the new scheme
of storing the hash value in the index and still reduce the effective
bucket size. Let's make the new size the (page size / 2^n) where n
is chosen appropriately. Then we we store the value in the page we
simply use n bits of the hash to determine which sub-piece of the
page to use to actually store the value. We may need to add a multiplier
to adjust the decision to split the page based on the mini-page. This
should allow us to much more densely pack the pages and approach the
1 item per bucket. This could easily shrink the size of the index by
a factor of 2^n. Let me know what you think.
  
My personal opinion is that something like this is required to take best 
advantage of hashes. I didn't respond immediately because I don't have 
advice on what the best implementation would look like.


I have also been wondering about the effect of a hash index that 
includes multiple values to the same key (either a non-unique key, or 
different tuples from the same key due to table updates). I started by 
thinking that the maximum number of hash entries per hash bucket should 
be kept to 2 or 3 to prevent reduction in performance to that of btree, 
but I think this doesn't work if multiple tuples can have the same key. 
Unless - the maps is hash value =1:1> index page =1:1> hash bucket =1:N> 
hash value =1:M=> tuples. Guarantee than N is small (either <= 2 or <=4 
depending on performance evaluation) by re-hashing if N ever becomes > 2 
or > 4. Choose a sparse harsh. Let 1:M be indefinite? Also, optimize the 
1:M=1:1 case, such that the value can be inline?


For most cases, I would think the above model would make it cheap to 
determine if the key does not exist, as well as the 1:1 (hash value:key) 
case requiring a single page lookup. Update performance would suffer, 
but lookup should be faster than btree in these cases, as btree often 
requires > 1 index page scan.


Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Jens-Wolfhard Schicke


More random thoughts:

- Hash-Indices are best for unique keys, but every table needs a new hash 
key, which means one more random page access. Is there any way to build 
multi-_table_ indices? A join might then fetch all table rows with a given 
unique key after one page fetch for the combined index.


- Hashes with trivial hash-functions (like identity) can also return rows 
in a desired order.


- Is there a case where a sequentially scanning a hash-index is useful? I 
can't find any, but maybe somebody else has a use-case.


- What about HashJoins when the referenced tables have hash-indices?

- What about hash-indices where entries are inserted for multiple columns. 
Assume a table like this:


CREATE TABLE link (obj_id1 INT4, obj_id2 INT4);

and a query like

SELECT * FROM link WHERE ? IN (obj_id1, obj_id2);

or some join using a similar condition. It might be a nice thing to insert 
entries at both HASH(obj_id1) and HASH(obj_id2), which would eliminate the 
need to check in two indices and do a bitmap OR. OTOH it might not be 
faster in any significant use cases because who'd need a link table with 
nearly unique linked objects?


- In cases where the distribution of the hash-function is good, but a small 
and relatively even number of rows exist for each key (like it might be the 
case in the above example), it might be nice to reserve a given amount of 
same-key row entries in each bucket, and hold a fill-count at the front of 
it. That would avoid costly page fetches after each collision. You'd create 
a hash-index with n-buckets, each m-elements large. When the bucket is 
full, the usual collision handling continues.


- About hash enlargement: What about always using only the first k bits of 
each hash value. When you find that the hash is "quite-full" (however that 
is defined and detected), k is increased by one, effectively doubling the 
hash size. New entries are then written as usual, while retrieving the old 
entries needs to test at the k-bit-position first and if there is a miss, 
also at the k-1-position and so forth. To limit this search, some 
background process could after analyzing the index move old entries to the 
now correct k-bit-position and increment some "min-k"-value once all old 
entries have been moved. After the hash has been increased, the index would 
approximately half it's speed for some time then. Additionally one could 
also insert the entry at the new position if it has been found at the old 
one only while using the index. A special "miss"-entry at the new position 
doesn't help if nothing could be found because the old positions will 
usually hold some data which resides there even if it uses k bits.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Tom Lane wrote:


Oleg Bartunov <[EMAIL PROTECTED]> writes:

On Mon, 10 Sep 2007, Simon Riggs wrote:

Can we include that functionality now?



This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
feel free to write it and submit.


... for 8.4.


It can be just a contrib module. There are several useful dictionaries
we need to port.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
  

Perhaps we're talking at cross purposes.



  
The problem with doing encoding validation in scan.l is that it lacks 
context. Null bytes are only the tip of the bytea iceberg, since any 
arbitrary sequence of bytes can be valid for a bytea.



If you think that, then we're definitely talking at cross purposes.
I assert we should require the post-scanning value of a string literal
to be valid in the database encoding.  If you want to produce an
arbitrary byte sequence within a bytea value, the way to get there is
for the bytea input function to do the de-escaping, not for the string
literal parser to do it.
  



[looks again ... thinks ...]

Ok, I'm sold.


The only reason I was considering not doing it in scan.l is that
scan.l's behavior ideally shouldn't depend on any changeable variables.
But until there's some prospect of database_encoding actually being
mutable at run time, there's not much point in investing a lot of sweat
on that either.
  


Agreed. This would just be one item of many to change if/when we ever 
come to that.

Instead, we have to mess with an unknown number of UDTs ...


  


We're going to have that danger anyway, aren't we, unless we check the 
encoding validity of the result of every UDF that returns some text 
type? I'm not going to lose sleep over something I can't cure but the 
user can - what concerns me is that our own code ensures data 
intregrity, including for encoding.



Anyway, it looks to me like we have the following items to do:

. add validity checking to the scanner
. fix COPY code
. fix chr()
. improve efficiency of validity checks, at least for UTF8


cheers

andrew

cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Tom Lane

Oleg Bartunov <[EMAIL PROTECTED]> writes:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
>> Can we include that functionality now?

> This could be realized very easyly using dict_strict, which returns
> only known words, and mapping contains only this dictionary. So, 
> feel free to write it and submit.

... for 8.4.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Those should be checked already --- if not, the right fix is still to
>> fix it there, not in per-datatype code.  I think we are OK though,
>> eg see "need_transcoding" logic in copy.c.

> Well, a little experimentation shows that we currently are not OK:
> in foo.data:
> \366\66

I looked at this and found that the problem is that
CopyReadAttributesText() does backslash conversions and doesn't validate
the result.

My proposal at this point is that both scan.l and CopyReadAttributesText
need to guarantee that the result of de-escaping is valid in the
database encoding: they should reject \0 and check validity of multibyte
characters.  You could optimize this fairly well (important in the COPY
case) by not redoing verifymbstr unless you'd seen at least one
numerical backslash-escape that set the high bit.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> The reason we are prepared to make an exception for Unicode is precisely 
> because the code point maps to an encoding pattern independently of 
> architecture, ISTM.

Right --- there is a well-defined standard for the numerical value of
each character in Unicode.  And it's also clear what to do in
single-byte encodings.  It's not at all clear what the representation
ought to be for other multibyte encodings.  A direct transliteration
of the byte sequence not only has endianness issues, but will have
a weird non-dense set of valid values because of the restrictions on
valid multibyte characters.

Given that chr() has never before behaved sanely for multibyte values at
all, extending it to Unicode code points is a reasonable extension,
and throwing error for other encodings is reasonable too.  If we ever do
come across code-point standards for other encodings we can adopt 'em at
that time.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Simon Riggs wrote:


On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:

On Mon, 10 Sep 2007, Simon Riggs wrote:


On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:

On Mon, 10 Sep 2007, Simon Riggs wrote:


It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.

Can we include that functionality now?


This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
feel free to write it and submit.


So there isn't one yet, but you think it will be easy to write and that
we should call it dict_strict?


we have dict_synonym already and if your list is not big you'll be happy.


So I need to do something like

CREATE TEXT SEARCH DICTIONARY my_diction (
   template = snowball,
   synonym = include_only_these_words
);

which will then look for a file called include_only_these_words.syn?

I would prefer to be able to do something like this

CREATE TEXT SEARCH DICTIONARY my_diction (
   template = snowball,
   include = justthese
);
...which makes more sense to anyone reading it
and I also want to make the comparison case insensitive.

Would it be better to
1. include a new dictionary file (dict_strict, as you suggest)
2. a) allow case sensitivity as another option in dictionaries
  b) allow "include" as another word for "stoplist", but with the
meaning reversed?

e.g.

CREATE TEXT SEARCH DICTIONARY my_diction (
   template = snowball,
   include = justthese,
   case_sensitive = true
);


No, you need to write new template, which efficiently works with
big lists and support case insensitive comparison.

 CREATE TEXT SEARCH TEMPLATE biglist (
  .
 );

 CREATE TEXT SEARCH DICTIONARY my_diction (
TEMPLATE = biglist,
DictFile = words,
case_sensitive =  true
 );



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Simon Riggs wrote:


On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking


GIN is great, sorry if that sounded negative.


GIN doesn't depend strongly on number of words. It has log(N) behaviour for
numbers of words because of using B-Tree over words.


log(N) in the number of distinct words, but every word you index results
in an index insert, so if we index more words than we need then the
insert rate will go down.


yes, there is room to improve support of very long posting lists


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Tom Lane

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Perhaps we're talking at cross purposes.

> The problem with doing encoding validation in scan.l is that it lacks 
> context. Null bytes are only the tip of the bytea iceberg, since any 
> arbitrary sequence of bytes can be valid for a bytea.

If you think that, then we're definitely talking at cross purposes.
I assert we should require the post-scanning value of a string literal
to be valid in the database encoding.  If you want to produce an
arbitrary byte sequence within a bytea value, the way to get there is
for the bytea input function to do the de-escaping, not for the string
literal parser to do it.

The current situation where there is overlapping functionality is a bit
unfortunate, but once standard_conforming_strings is on by default,
it'll get a lot easier to work with.  I'm not eager to contort APIs
throughout the backend in order to produce a more usable solution for
the standard_conforming_strings = off case, given the expected limited
lifespan of that usage.

The only reason I was considering not doing it in scan.l is that
scan.l's behavior ideally shouldn't depend on any changeable variables.
But until there's some prospect of database_encoding actually being
mutable at run time, there's not much point in investing a lot of sweat
on that either.

> I still don't see why it's OK for us to do validation from the foo_recv 
> functions but not the corresponding foo_in functions.

Basically, a CSTRING handed to an input function should already have
been encoding-validated by somebody.  The most obvious reason why this
must be so is the embedded-null problem, but in general it will already
have been validated (typically as part of a larger string such as the
whole SQL query or whole COPY data line), and doing that over is
pointless and expensive.  On the other hand, the entire point of a recv
function is that it gets raw data that no one else in the backend knows
the format of; so if the data is to be considered textual, the recv
function has to be the one that considers it so and invokes appropriate
conversion or validation.

The reason backslash escapes in string literals are a problem is that
they can produce incorrect-encoding results from what had been a
validated string.

>  At least in the 
> short term that would provide us with fairly complete protection against 
> accepting invalidly encoded data into the database, once we fix up 
> chr(), without having to mess with the scanner, parser, COPY code etc. 

Instead, we have to mess with an unknown number of UDTs ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Albe Laurenz wrote:

I'd like to repeat my suggestion for chr() and ascii().

Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().


  

[snip]

Of course, if it is generally perceived that the code point
is more useful than the encoding, then Oracle compliance
is probably secondary.


  


Last time this was discussed, you were the only person arguing for that 
behaviour, IIRC.


And frankly, I don't know how to do it sanely anyway. A character 
encoding has a fixed byte pattern, but a given byte pattern doesn't have 
a single universal number value. I really don't think we want to have 
the value of chr(n) depend on the endianness of the machine, do we?


The reason we are prepared to make an exception for Unicode is precisely 
because the code point maps to an encoding pattern independently of 
architecture, ISTM.


cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote:
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking

GIN is great, sorry if that sounded negative.

> GIN doesn't depend strongly on number of words. It has log(N) behaviour for 
> numbers of words because of using B-Tree over words.

log(N) in the number of distinct words, but every word you index results
in an index insert, so if we index more words than we need then the
insert rate will go down.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
> 
> > On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
> >> On Mon, 10 Sep 2007, Simon Riggs wrote:
> >>
> >>> It seems possible to write your own functions to support various
> >>> possibilities with text search.
> >>>
> >>> One of the more common thoughts is to have a list of words that you
> >>> would like to include, i.e. the opposite of a stop word list.
> >>>
> >>> There are clear indications that indexing too many words is a problem
> >>> for both GIN and GIST. If people already know what they'll be looking
> >>> for and what they will never be looking for, it seems easier to supply
> >>> that list up front, rather than hide it behind lots of hand-crafted
> >>> code.
> >>>
> >>> Can we include that functionality now?
> >>
> >> This could be realized very easyly using dict_strict, which returns
> >> only known words, and mapping contains only this dictionary. So,
> >> feel free to write it and submit.
> >
> > So there isn't one yet, but you think it will be easy to write and that
> > we should call it dict_strict?
> 
> we have dict_synonym already and if your list is not big you'll be happy.

So I need to do something like

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
synonym = include_only_these_words
);

which will then look for a file called include_only_these_words.syn?

I would prefer to be able to do something like this

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
include = justthese
);
...which makes more sense to anyone reading it
and I also want to make the comparison case insensitive.

Would it be better to
1. include a new dictionary file (dict_strict, as you suggest)
2. a) allow case sensitivity as another option in dictionaries
   b) allow "include" as another word for "stoplist", but with the
meaning reversed?

e.g.

CREATE TEXT SEARCH DICTIONARY my_diction (
template = snowball,
include = justthese,
case_sensitive = true
);

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Heikki Linnakangas

Tom Lane wrote:
> I don't know enough about ispell to
> understand what its config files look like.  (There's a problem of
> missing documentation here, too...)

Yeah :(. The file format that ispell accepts is kind of ad hoc. It
accepts hunspell and ispell and myspell variants, but only a subset of
the full grammar (some stuff is not relevant for tsearch). A description
of what exactly it's supposed to accept would be nice.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Teodor Sigaev


There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
GIN doesn't depend strongly on number of words. It has log(N) behaviour for 
numbers of words because of using B-Tree over words.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:
> On Mon, 10 Sep 2007, Simon Riggs wrote:
> 
> > It seems possible to write your own functions to support various
> > possibilities with text search.
> >
> > One of the more common thoughts is to have a list of words that you
> > would like to include, i.e. the opposite of a stop word list.
> >
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking
> > for and what they will never be looking for, it seems easier to supply
> > that list up front, rather than hide it behind lots of hand-crafted
> > code.
> >
> > Can we include that functionality now?
> 
> This could be realized very easyly using dict_strict, which returns
> only known words, and mapping contains only this dictionary. So, 
> feel free to write it and submit.

So there isn't one yet, but you think it will be easy to write and that
we should call it dict_strict?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Teodor Sigaev


How does that allow me to limit the number of words to a known list?


If all dictionaries returns NULL for token the this token will not be indexed at 
all.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] ispell dictionary broken in CVS HEAD ?

2007-09-10 Thread Teodor Sigaev




Is anyone working on providing basic regression tests for the different
dictionary types?  Seems like the main stumbling block is providing
I'll do some tests for dictionaries, but it will be synthetic dictionary. 
Original ispell files is rather big, so I'll make rather simple and small one.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Simon Riggs wrote:


On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote:

On Mon, 10 Sep 2007, Simon Riggs wrote:


It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.

Can we include that functionality now?


This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So,
feel free to write it and submit.


So there isn't one yet, but you think it will be easy to write and that
we should call it dict_strict?


we have dict_synonym already and if your list is not big you'll be happy.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Simon Riggs wrote:


On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:

Simon Riggs wrote:

It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.


I don't understand what you're proposing. We already have dict_synonym
that you can use to accept a simple list of words.


How does that allow me to limit the number of words to a known list?


text search doesn't index unknown words, so if your mapping contains
only one dictionary, this dictionary will control what words to index.
While dict_synonym is good for not big list I'd write separate dictionary
with fast lookup.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > It seems possible to write your own functions to support various
> > possibilities with text search.
> > 
> > One of the more common thoughts is to have a list of words that you
> > would like to include, i.e. the opposite of a stop word list. 
> > 
> > There are clear indications that indexing too many words is a problem
> > for both GIN and GIST. If people already know what they'll be looking
> > for and what they will never be looking for, it seems easier to supply
> > that list up front, rather than hide it behind lots of hand-crafted
> > code.
> 
> I don't understand what you're proposing. We already have dict_synonym
> that you can use to accept a simple list of words. 

How does that allow me to limit the number of words to a known list?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Andrew Dunstan




Tom Lane wrote:

Andrew Dunstan <[EMAIL PROTECTED]> writes:
  

Tom Lane wrote:

In the short run it might be best to do it in scan.l after all.  
  


  
I have not come up with a way of doing that and handling the bytea case. 



AFAICS we have no realistic choice other than to reject \0 in SQL
literals; to do otherwise requires API changes throughout that stack of
modules.  And once you admit \0 is no good, it's not clear that
\somethingelse is any better for bytea-using clients.  Moreover, given
that we are moving away from backslash escapes as fast as we can sanely
travel, expending large amounts of energy to make them work better
doesn't seem like a good use of development manpower.

  
  


Perhaps we're talking at cross purposes.

The problem with doing encoding validation in scan.l is that it lacks 
context. Null bytes are only the tip of the bytea iceberg, since any 
arbitrary sequence of bytes can be valid for a bytea. So we can only do 
validation of encoding on a literal when we know it isn't destined for a 
bytea. That's what I haven't come up with a way of doing in the scanner 
(and as you noted upthread it's getting pretty darn late in the cycle 
for us to be looking for ways to do things).


I still don't see why it's OK for us to do validation from the foo_recv 
functions but not the corresponding foo_in functions. At least in the 
short term that would provide us with fairly complete protection against 
accepting invalidly encoded data into the database, once we fix up 
chr(), without having to mess with the scanner, parser, COPY code etc. 
We could still get corruption from UDFs and UDTs - it's hard to see how 
we can avoid that danger. Yes, we would need to make sure that any 
additions to the type system acted properly, and yes we should fix up 
any validation inefficiencies (like possibly inlining calls in the UTF8 
validation code). Personally, I don't see those as killer objections.


cheers

andrew



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] integrated tsearch doesn't work with non utf8 database

2007-09-10 Thread Teodor Sigaev


Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the problem with the stopword file.


Attached patch should fix it, I hope.

New plan:
 Hash Join  (cost=2.80..1073.85 rows=80 width=100)
   Hash Cond: (parse.tokid = tt.tokid)
   InitPlan
 ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
   Filter: (oid = 11308::oid)
 ->  Seq Scan on pg_ts_config  (cost=0.00..1.20 rows=1 width=4)
   Filter: (oid = 11308::oid)
   ->  Function Scan on ts_parse parse  (cost=0.00..12.50 rows=1000 width=36)
   ->  Hash  (cost=0.20..0.20 rows=16 width=68)
 ->  Function Scan on ts_token_type tt  (cost=0.00..0.20 rows=16 
width=68)
   SubPlan
 ->  Limit  (cost=6.57..6.60 rows=1 width=36)
   ->  Subquery Scan dl  (cost=6.57..6.60 rows=1 width=36)
 ->  Sort  (cost=6.57..6.58 rows=1 width=8)
   Sort Key: ((ts_lexize(m.mapdict, $1) IS NULL)), 
m.mapseqno
   ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 
rows=1 width=8)
 Filter: ((mapcfg = 11308::oid) AND (maptokentype = 
$0))

 ->  Sort  (cost=6.57..6.57 rows=1 width=8)
   Sort Key: m.mapseqno
   ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1 width=8)
 Filter: ((mapcfg = 11308::oid) AND (maptokentype = $0))


At least, it checks only needed dictionaries.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
*** ./src/backend/catalog/system_views.sql.orig Mon Sep 10 15:51:27 2007
--- ./src/backend/catalog/system_views.sql  Mon Sep 10 16:09:52 2007
***
*** 415,422 
  ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS 
lex
FROM pg_catalog.pg_ts_config_map AS m
WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid
!   ORDER BY m.mapseqno ) dl
! WHERE dl.lex IS NOT NULL
  LIMIT 1
  ) AS "Lexized token"
  FROM pg_catalog.ts_parse(
--- 415,421 
  ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS 
lex
FROM pg_catalog.pg_ts_config_map AS m
WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid
!   ORDER BY pg_catalog.ts_lexize(mapdict, parse.token) IS NULL, 
m.mapseqno ) dl
  LIMIT 1
  ) AS "Lexized token"
  FROM pg_catalog.ts_parse(

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Oleg Bartunov


On Mon, 10 Sep 2007, Simon Riggs wrote:


It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list.

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.

Can we include that functionality now?


This could be realized very easyly using dict_strict, which returns
only known words, and mapping contains only this dictionary. So, 
feel free to write it and submit.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Include Lists for Text Search

2007-09-10 Thread Heikki Linnakangas

Simon Riggs wrote:
> It seems possible to write your own functions to support various
> possibilities with text search.
> 
> One of the more common thoughts is to have a list of words that you
> would like to include, i.e. the opposite of a stop word list. 
> 
> There are clear indications that indexing too many words is a problem
> for both GIN and GIST. If people already know what they'll be looking
> for and what they will never be looking for, it seems easier to supply
> that list up front, rather than hide it behind lots of hand-crafted
> code.

I don't understand what you're proposing. We already have dict_synonym
that you can use to accept a simple list of words. But that doesn't
change the way GIN and GiST works.

?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases

2007-09-10 Thread Simon Riggs

On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote:

> So I've been seeing/hearing all of the hoopla over vertical databases 
> (column stores), and how they'll not only slice bread but also make 
> toast, etc.  I've done some quick searches for past articles on 
> "C-Store", "Vertica", "Column Store", and "Vertical Database", and have 
> seen little discussion on this. 

I looked at doing this a while back, for similar reasons.

ISTM we would be able to do this fairly well if we implemented
Index-only columns. i.e. columns that don't exist in the heap, only in
an index. 

Taken to the extreme, all columns could be removed from the heap and
placed in an index(es). Only the visibility information would remain on
the heap.

Syntax for this would be an ALTER TABLE SET STORAGE command, with a new
type of storage definition that will only be accepted if an index
already has been defined on the table which includes the specified
column. Doing this per column would be a big win over vertical databases
since AFAIK they *have* to do this to *every* column, even if it is not
beneficial to do so.

Every existing index plan works immediately. The main annoyance is
retrieving a column value that doesn't exist on the heap. That would
require a new kind of plan that involves preparing the index(es) by
sorting them on htid and then doing a left merge join with the main
heap. By now, most people will be screaming at their monitors "what an
idiot, thats gonna REALLY suck". True, but this is the same thing that
column-oriented databases have to do also, so it would be taking the
best that vertical databases have to offer and accepting the
consequences. There are some other plan possibilities also, apart from
this basic value retrieval, but they would require further index API
changes; I'm not certain of those as being primary use cases, however.

Vertical databases honestly do have their uses and there are many kinds
of marketing query that have complex where clauses yet only simple
select clauses. There are a number of industry-specific database
products that have utilised this technique to good effect for a number
of years.

So ISTM the main changes would be executor changes to allow retrieving
column values of index-only columns when they are required, and to
modify the insert/update tuple code somewhat. I thought maybe we can
call it COAST, Column-oriented attribute storage technique, :-)

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] Include Lists for Text Search

2007-09-10 Thread Simon Riggs

It seems possible to write your own functions to support various
possibilities with text search.

One of the more common thoughts is to have a list of words that you
would like to include, i.e. the opposite of a stop word list. 

There are clear indications that indexing too many words is a problem
for both GIN and GIST. If people already know what they'll be looking
for and what they will never be looking for, it seems easier to supply
that list up front, rather than hide it behind lots of hand-crafted
code.

Can we include that functionality now?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Hash index todo list item

2007-09-10 Thread Jens-Wolfhard Schicke

--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke 
<[EMAIL PROTECTED]> wrote:

Kenneth Marshall wrote:
Along with the hypothetical performance
wins, the hash index space efficiency would be improved by a similar
factor. Obviously, all of these ideas would need to be tested in
various workload environments. In the large index arena, 10^6 to 10^9
keys and more, space efficiency will help keep the index manageable
in todays system memories.


Space efficiency is provided by not storing the key, nor the header data
required (length prefix?).
Space efficiency at ~1 entry per bucket: How about using closed hashing, 
saving in each page a bitmask in front which specifies which entries hold 
valid entries and in the rest of the page row-pointers (is this the correct 
expression? I don't know...) without further data. Should provide 
reasonably simple data structure and alignment for the pointers.



Please keep the ideas and comments coming. I am certain that a synthesis
of them will provide an implementation with the performance
characteristics
that we are seeking.


One should look into new plan nodes for "!= ANY()", "NOT EXISTS" and 
similar. A node like "look into hash and true if bucket is empty" would 
work without checking tuple visibility when the bucket is empty and could 
be a win in some situations.


Do we want special cases for short keys like INT4? In those cases the 
implementation might use hash == key and put that knowledge to use in 
plans. Even a unique constraint might then be doable. Does the 
postgresql-storage backend on linux support sparse files? Might be a win 
when holes in the sequence turn up.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread db

>>> I think the concern is when they use only one slash, like:
>>>   E'\377\000\377'::bytea
>>> which, as I mentioned before, is not correct anyway.
>
> Wait, why would this be wrong? How would you enter the three byte bytea of
> consisting of those three bytes described above?

Either as

E'\\377\\000\\377'

or

'\377\000\377'

/Dennis



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Gregory Stark


"Tom Lane" <[EMAIL PROTECTED]> writes:

> Jeff Davis <[EMAIL PROTECTED]> writes:
>
>> I think the concern is when they use only one slash, like:
>>   E'\377\000\377'::bytea
>> which, as I mentioned before, is not correct anyway.

Wait, why would this be wrong? How would you enter the three byte bytea of
consisting of those three bytes described above?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] invalidly encoded strings

2007-09-10 Thread Albe Laurenz

Tom Lane wrote:
>> . for chr() under UTF8, it seems to be generally agreed
>> that the argument should represent the codepoint and the
>> function should return the correspondingly encoded character.
>> If so, possible the argument should be a bigint to
>> accommodate the full range of possible code points.
>> It is not clear what the argument should represent for other
>> multi-byte encodings for any argument higher than 127.
>> Similarly, it is not clear what ascii() should return in
>> such cases. I would be inclined just to error out.
> 
> In SQL_ASCII I'd argue for allowing 0..255.  In actual MB
> encodings, OK with throwing error.

I'd like to repeat my suggestion for chr() and ascii().

Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().

The advantage I see is the following:

- It would make these functions from oracle_compat.c
  compatible with Oracle (Oracle's chr() and ascii() work
  the way I suggest).

I agree with Tom's earlier suggestion to throw an error for
chr(0), although this is not what Oracle does.

Of course, if it is generally perceived that the code point
is more useful than the encoding, then Oracle compliance
is probably secondary.

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 6: explain analyze is your friend

83 matches

Mail list logo