Re: [HACKERS] invalidly encoded strings
>> Try the sequence below. Then, try to dump and then reload the database. >> When you try to reload it, you will get an error: >> >> ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > I know this could be a problem (like chr() with invalid byte pattern). And that's enough of a problem already. We don't need more problems. > What I really want to know is, read query something like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > > could be a problem (I assume we use C locale). If convert() produce a sequence of bytes that can't be interpreted as a string in the server encoding then it's broken. Imho convert() should return a bytea value. If we hade good encoding/charset support we could do better, but we can't today. The above example would work fine if convert() returned a bytea. In the C locale the string would be compared byte for byte and that's what you get with bytea values as well. Strings are not sequences of bytes that can be interpreted in different ways. That's what bytea values are. Strings are in a specific encoding always, and in pg that encoding is fixed to a single one for a whole cluster at initdb time. We should not confuse text with bytea. /Dennis ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] invalidly encoded strings
On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote: > > > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > > Please show me concrete examples how I could introduce a > vulnerability > > > using this kind of convert() usage. > > > > Try the sequence below. Then, try to dump and then reload the > database. > > When you try to reload it, you will get an error: > > > > ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > I know this could be a problem (like chr() with invalid byte pattern). > What I really want to know is, read query something like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); I guess I don't quite understand the question. I agree that ORDER BY convert() must be safe in the C locale, because it just passes the strings to strcmp(). Are you saying that we should not remove convert() until we can support multiple locales in one database? If we make convert() operate on bytea and return bytea, as Tom suggested, would that solve your use case? Regards, Jeff Davis ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote: > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > characters are reasonably ordered in EUC_JP. The usual way to approach this is to make convert return bytea instead of text. Then your problems vanish. Bytea can still be sorted, but it won't be treated as a text string and thus does not need to conform to the requirements of a text string. Languages like perl distinguish between "encode" which is text->bytea and "decode" which is bytea->text. We've got "convert" for oth and that causes problems. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] invalidly encoded strings
> On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > Please show me concrete examples how I could introduce a vulnerability > > using this kind of convert() usage. > > Try the sequence below. Then, try to dump and then reload the database. > When you try to reload it, you will get an error: > > ERROR: invalid byte sequence for encoding "UTF8": 0xbd I know this could be a problem (like chr() with invalid byte pattern). What I really want to know is, read query something like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); could be a problem (I assume we use C locale). -- Tatsuo Ishii SRA OSS, Inc. Japan > Regards, > Jeff Davis > > test=> select version(); > > version > -- > PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc > (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12) > (1 row) > > test=> show lc_collate; > lc_collate > - > en_US.UTF-8 > (1 row) > > test=> create table encoding_test(t text); > CREATE TABLE > test=> insert into encoding_test values('初'); > INSERT 0 1 > test=> insert into encoding_test values(convert('初' using > utf8_to_euc_jp)); > INSERT 0 1 > > ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Ts_rank internals
I tried to understand how ts_rank works, but I failed. What does Cover function do? How does it work? What is the DocRepresentation data structure like? I can see the definition of the struct, and the get_docrep function to convert to that format, but by reading those I can't figure out what the resulting DocRepresentation looks like. I wonder if we could get rid of the istrue flag in QueryOperand, and use a local BitmapSet variable instead? It seems wrong to have a temporary flag that's only used in one function, in a struct that's used everywhere. It's a play around CDR algorithms (Cover Density Ranking). Based on paper Clarke et al., “Relevance Ranking for One to Three Term Queries.” " (http://citeseer.ist.psu.edu/clarke00relevance.html. Sorry, I lost the article itself, but may be Oleg has it. Simple and short description is placed at http://www2002.org/CDROM/refereed/643/node7.html. We change original algorithm to support weight of lexeme, details are on Oleg's site: http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking Array of DocRepresentation is a representation of document, it contains only lexemes from both tsvector and tsquery, and lexemes are ordered by position - as in original doc. Each DocRepresentation has links to corresponding QueryOperand to optimize query execution while extent search. When we enlarge current extent for one word then we set istrue flag for corresponding QueryOperand and execution tsquery from cover becomes very simple task. It's possible to eliminate istrue flag, but it's needed to implement algorithm to execute tsquery over continuos part of document, not over whole document. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] invalidly encoded strings
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > Please show me concrete examples how I could introduce a vulnerability > using this kind of convert() usage. Try the sequence below. Then, try to dump and then reload the database. When you try to reload it, you will get an error: ERROR: invalid byte sequence for encoding "UTF8": 0xbd Regards, Jeff Davis test=> select version(); version -- PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12) (1 row) test=> show lc_collate; lc_collate - en_US.UTF-8 (1 row) test=> create table encoding_test(t text); CREATE TABLE test=> insert into encoding_test values('初'); INSERT 0 1 test=> insert into encoding_test values(convert('初' using utf8_to_euc_jp)); INSERT 0 1 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Testing 8.3 LDC vs. 8.2.4 with aggressive BGW
Renaming the old thread to more appropriately address the topic: On Wed, 5 Sep 2007, Kevin Grittner wrote: Then I would test the new background writer with synchronous commits under the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you recommended for a test are how far from the LRU end of the cache to look for dirty pages to write, correct? This is alluding to the suggestions I gave at http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php checkpoint_completion_target has nothing to do with the LRU, so let's step back to fundamentals and talk about what it actually does. The official documentation is at http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html As you generate transactions, Postgres puts data into the WAL. The WAL is organized into segments that are typically 16MB each. Periodically, the system hits a checkpoint where the WAL data up to a certain point is guaranteed to have been applied to the database, at which point the old WAL files aren't needed anymore and can be reused. These checkpoints are generally caused by one of two things happening: 1) checkpoint_segments worth of WAL files have been written 2) more than checkpoint_timeout seconds have passed since the last checkpoint The system doesn't stop working while the checkpoint is happening; it just keeps creating new WAL files. As long as the checkpoint finishes in advance of what the next one is required things performance should be fine. In the 8.2 model, processing the checkpoint occurs as fast as data can be written to disk. In 8.3, the writes can be spread out instead. What checkpoint_completion_target does is suggest how far along the system should aim to have finished the current checkpoint relative to when the next one is expected. For example, your current system has checkpoint_segments=10. Assume that you have checkpoint_timeout set to a large number such that the checkpoints are typically being driven by the number of segments being filled (so you get a checkpoint every 10 WAL segments, period). If checkpoint_completion_target was set to 0.5, the expectation is that the writes for the currently executing checkpoint would be finished about the time that 0.5*10=5 segments of new WAL data had been written. If you set it to 0.9 instead, you'd expect the checkpoint is finishing just about when the 9th WAL segment is being written out, which is cutting things a bit tight; somewhere around there is the safe upper limit for that parameter. Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that on your current system that's forcing very regular checkpoints, which makes each individual checkpoint have less work to do and therefore reduces the impact of the spikes you're trying to avoid. With LDC and checkpoint_completion_target, you can make that number much bigger (I suggested 50), which means you'll only have 1/5 as many checkpoints causing I/O spikes, and each of those checkpoints will have 5X as long to potentially spread the writes over. The main cost is that it will take longer to recover if your database crashes, which hopefully is a rare event. Having far less checkpoints is obviously a win for your situation, but the open question is whether this fashion of spreading them out will reduce the I/O spike as effectively as the all-scan background writer in 8.2 has been working for you. This is one aspect that makes your comparision a bit tricky. It's possible that by increasing the segments enough, you'll get into a situation where you don't see (m)any of them during your testing run of 8.3. You should try and collect some data on how regularly checkpoints are happening during early testing to get an idea if this is a possibility. The usual approach is to set checkpoint_warning to a really high number (like the maximum of 3600) and then you'll get a harmless note in the logs every time one happens, and that will show you how frequently they're happening. It's kind of important to have an idea how many checkpoints you can expect during each test run to put together a fair comparison; as you increase checkpoint_segments, you need to adopt a mindset that is considering "how many sluggish transactions am I seeing per checkpoint?", not how many total per test run. I have a backport of some of the pg_stat_bgwriter features added to 8.3 that can be applied to 8.2 that might be helpful for monitoring your test benchmarking server (this is most certainly *not* suitable to go onto the real one) at http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might want to take a look at; I put that together specifically for allowing easier comparisions of 8.2 and 8.3 in this area. Are the current shared memory and the 1 GB you suggested enough of a spread for these tests? (At several hours per test in order to get meaningful results, I don't want to get into too many p
Re: [HACKERS] invalidly encoded strings
On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote: > > Isn't the collation a locale issue, not an encoding issue? Is there a > > ja_JP.UTF-8 that defines the proper order? > > I don't think it helps. The point is, he needs different language's > collation, while PostgreSQL allows only one collation(locale) per > database cluster. My thought was: if we had some function that treated a string as a different locale, it might solve the problem without violating the database encoding. Converting to a different encoding, and producing text result, is the source of the problem. Regards, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] What is happening on buildfarm member dugong?
dugong has been failing contribcheck repeatably for the last day or so, with a very interesting symptom: CREATE DATABASE is failing with ERROR: could not fsync segment 0 of relation 1663/40960/41403: No such file or directory ERROR: checkpoint request failed HINT: Consult recent messages in the server log for details. I'd think we broke something in fsync cancellation signalling, except that AFAICS there were no CVS commits at all between the last working build at 2007-09-09 23:05:01 UTC and the first failure at 2007-09-10 00:45:27 UTC (and even if I got the timezone conversion wrong, there are no nearby commits in the backend/storage area). Has this machine had any system-software updates around then? Can anyone suggest another theory? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
> Tatsuo Ishii <[EMAIL PROTECTED]> writes: > >> BTW, it strikes me that there is another hole that we need to plug in > >> this area, and that's the convert() function. Being able to create > >> a value of type text that is not in the database encoding is simply > >> broken. Perhaps we could make it work on bytea instead (providing > >> a cast from text to bytea but not vice versa), or maybe we should just > >> forbid the whole thing if the database encoding isn't SQL_ASCII. > > > Please don't do that. It will break an usefull use case of convert(). > > The reason we have a problem here is that we've been choosing > convenience over safety in encoding-related issues. I wonder if we must > stoop to having a "strict_encoding_checks" GUC variable to satisfy > everyone. Please show me concrete examples how I could introduce a vulnerability using this kind of convert() usage. > > A user has a database encoded in UTF-8. He has English, French, > > Chinese and Japanese data in tables. To sort the tables in the > > language order, he will do like this: > > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > > utf8_to_euc_jp); > > > Without using convert(), he will get random order of data. > > I'd say that *with* convert() he will get a random order of data. This > is making a boatload of unsupportable assumptions about the locale and > encoding of the surrounding database. There are a lot of bad-encoding > situations for which strcoll() simply breaks down completely and can't > even deliver self-consistent answers. > > It might work the way you are expecting if the database uses SQL_ASCII > encoding and C locale --- and I'd be fine with allowing convert() only > when the database encoding is SQL_ASCII. I don't believe that. With C locale, the convert() works fine as I described. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] invalidly encoded strings
Tatsuo Ishii <[EMAIL PROTECTED]> writes: >> BTW, it strikes me that there is another hole that we need to plug in >> this area, and that's the convert() function. Being able to create >> a value of type text that is not in the database encoding is simply >> broken. Perhaps we could make it work on bytea instead (providing >> a cast from text to bytea but not vice versa), or maybe we should just >> forbid the whole thing if the database encoding isn't SQL_ASCII. > Please don't do that. It will break an usefull use case of convert(). The reason we have a problem here is that we've been choosing convenience over safety in encoding-related issues. I wonder if we must stoop to having a "strict_encoding_checks" GUC variable to satisfy everyone. > A user has a database encoded in UTF-8. He has English, French, > Chinese and Japanese data in tables. To sort the tables in the > language order, he will do like this: > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > Without using convert(), he will get random order of data. I'd say that *with* convert() he will get a random order of data. This is making a boatload of unsupportable assumptions about the locale and encoding of the surrounding database. There are a lot of bad-encoding situations for which strcoll() simply breaks down completely and can't even deliver self-consistent answers. It might work the way you are expecting if the database uses SQL_ASCII encoding and C locale --- and I'd be fine with allowing convert() only when the database encoding is SQL_ASCII. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] invalidly encoded strings
Andrew Dunstan <[EMAIL PROTECTED]> writes: > I'm not sure we are going to be able to catch every path by which > invalid data can get into the database in one release. I suspect we > might need two or three goes at this. (I'm just wondering if the > routines that return cstrings are a possible vector). We need to have a policy that cstrings are in the database encoding. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] invalidly encoded strings
Jeff Davis wrote: On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Perhaps we could make it work on bytea instead (providing a cast from text to bytea but not vice versa), or maybe we should just forbid the whole thing if the database encoding isn't SQL_ASCII. Please don't do that. It will break an usefull use case of convert(). A user has a database encoded in UTF-8. He has English, French, Chinese and Japanese data in tables. To sort the tables in the language order, he will do like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); Without using convert(), he will get random order of data. This is because Kanji characters are in random order in UTF-8, while Kanji characters are reasonably ordered in EUC_JP. Isn't the collation a locale issue, not an encoding issue? Is there a ja_JP.UTF-8 that defines the proper order? That won't help you much if you have all the collection mentioned above. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] invalidly encoded strings
Tatsuo Ishii wrote: BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Perhaps we could make it work on bytea instead (providing a cast from text to bytea but not vice versa), or maybe we should just forbid the whole thing if the database encoding isn't SQL_ASCII. Please don't do that. It will break an usefull use case of convert(). A user has a database encoded in UTF-8. He has English, French, Chinese and Japanese data in tables. To sort the tables in the language order, he will do like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); Without using convert(), he will get random order of data. This is because Kanji characters are in random order in UTF-8, while Kanji characters are reasonably ordered in EUC_JP. Tatsuo-san, would not this case be at least as well met by an operator supplying the required ordering? The operator of course would not have the danger of supplying values that are invalid in the database encoding. Admittedly, the user might need several operators for the case you describe. I'm not sure we are going to be able to catch every path by which invalid data can get into the database in one release. I suspect we might need two or three goes at this. (I'm just wondering if the routines that return cstrings are a possible vector). cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > > BTW, it strikes me that there is another hole that we need to plug in > > > this area, and that's the convert() function. Being able to create > > > a value of type text that is not in the database encoding is simply > > > broken. Perhaps we could make it work on bytea instead (providing > > > a cast from text to bytea but not vice versa), or maybe we should just > > > forbid the whole thing if the database encoding isn't SQL_ASCII. > > > > Please don't do that. It will break an usefull use case of convert(). > > > > A user has a database encoded in UTF-8. He has English, French, > > Chinese and Japanese data in tables. To sort the tables in the > > language order, he will do like this: > > > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > > utf8_to_euc_jp); > > > > Without using convert(), he will get random order of data. This is > > because Kanji characters are in random order in UTF-8, while Kanji > > characters are reasonably ordered in EUC_JP. > > Isn't the collation a locale issue, not an encoding issue? Is there a > ja_JP.UTF-8 that defines the proper order? I don't think it helps. The point is, he needs different language's collation, while PostgreSQL allows only one collation(locale) per database cluster. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] invalidly encoded strings
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > BTW, it strikes me that there is another hole that we need to plug in > > this area, and that's the convert() function. Being able to create > > a value of type text that is not in the database encoding is simply > > broken. Perhaps we could make it work on bytea instead (providing > > a cast from text to bytea but not vice versa), or maybe we should just > > forbid the whole thing if the database encoding isn't SQL_ASCII. > > Please don't do that. It will break an usefull use case of convert(). > > A user has a database encoded in UTF-8. He has English, French, > Chinese and Japanese data in tables. To sort the tables in the > language order, he will do like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > characters are reasonably ordered in EUC_JP. Isn't the collation a locale issue, not an encoding issue? Is there a ja_JP.UTF-8 that defines the proper order? Regards, Jeff Davis ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] "txn" in pg_stat_activity
Neil Conway <[EMAIL PROTECTED]> writes: > I personally find "xact" to be a less intuitive abbreviation of > "transaction" than "txn", but for the sake of consistency, I agree it is > probably better to use "xact_start". Barring other objections, I'll go make this happen. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] invalidly encoded strings
> Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > If you regard the unicode code point as simply a number, why not > > regard the multibyte characters as a number too? > > Because there's a standard specifying the Unicode code points *as > numbers*. The mapping from those numbers to UTF8 strings (and other > representations) is well-defined by the standard. > > > Also I'm wondering you what we should do with different > > backend/frontend encoding combo. > > Nothing. chr() has always worked with reference to the database > encoding, and we should keep it that way. Where is it documented? > BTW, it strikes me that there is another hole that we need to plug in > this area, and that's the convert() function. Being able to create > a value of type text that is not in the database encoding is simply > broken. Perhaps we could make it work on bytea instead (providing > a cast from text to bytea but not vice versa), or maybe we should just > forbid the whole thing if the database encoding isn't SQL_ASCII. Please don't do that. It will break an usefull use case of convert(). A user has a database encoded in UTF-8. He has English, French, Chinese and Japanese data in tables. To sort the tables in the language order, he will do like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); Without using convert(), he will get random order of data. This is because Kanji characters are in random order in UTF-8, while Kanji characters are reasonably ordered in EUC_JP. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] invalidly encoded strings
Tatsuo Ishii wrote: If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? No. The number corresponding to a given byte pattern depends on the endianness of the architecture. That's exactly why we can't sanely use the byte pattern of the encoded characters as numbers. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
Tatsuo Ishii <[EMAIL PROTECTED]> writes: > If you regard the unicode code point as simply a number, why not > regard the multibyte characters as a number too? Because there's a standard specifying the Unicode code points *as numbers*. The mapping from those numbers to UTF8 strings (and other representations) is well-defined by the standard. > Also I'm wondering you what we should do with different > backend/frontend encoding combo. Nothing. chr() has always worked with reference to the database encoding, and we should keep it that way. BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Perhaps we could make it work on bytea instead (providing a cast from text to bytea but not vice versa), or maybe we should just forbid the whole thing if the database encoding isn't SQL_ASCII. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway. (see the UCS-4 standard for more details) > > > > The code point is simply a number. The result of chr() will be a text > value one char (not one byte) wide, in the relevant database encoding. > > U+ maps to the same Unicode char and hence the same UTF8 encoding > pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on > all machines. So to get this char in a UTF8 database you could call > "select chr(169)" and get back the byte pattern \xC2A9. If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? Also I'm wondering you what we should do with different backend/frontend encoding combo. For example, if your database is in UTF-8, and your client encoding is LATIN2, what integer value should be passed to chr()? LATIN2 or Unicode code point? > > Or are you going to represent Unicode point as a character string such > > as 'U+0259'? Then representing any encoding as a string could avoid > > endianness issues anyway, and I don't see Unicode code point is any > > better than others. > > > > The argument will be a number, as now. > > > Also I'd like to point out all encodings has its own code point > > systems as far as I know. For example, EUC-JP has its corresponding > > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > > we can't use "code point" as chr()'s argument for othe encodings(of > > course we need optional parameter specifying which character set is > > supposed). > > > > Where can I find the tables that map code points (as opposed to > encodings) to characters for these others? You mean code point table of character set? The actual standard is not on the web since it is copyrighted by the Japanese goverment (you need to buy as a book or a pdf file). However you could find many code point tables on the web. For example, JIS X 0208 code points can be found on: http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html (you need to have a Japanese font and set page encoding to Shift JIS) BTW, if you want to pass "code point of character set" rather than encoding value, you need to give chr() what character set you are reffering to. So we need to have two arguments, one is for code point, the other is for character set specification. What do you think? -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway. (see the UCS-4 standard for more details) > > > > The code point is simply a number. The result of chr() will be a text > value one char (not one byte) wide, in the relevant database encoding. > > U+ maps to the same Unicode char and hence the same UTF8 encoding > pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on > all machines. So to get this char in a UTF8 database you could call > "select chr(169)" and get back the byte pattern \xC2A9. If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? Also I'm wondering you what we should do with different backend/frontend encoding combo. For example, if your database is in UTF-8, and your client encoding is LATIN2, what integer value should be passed to chr()? LATIN2 or Unicode code point? > > Or are you going to represent Unicode point as a character string such > > as 'U+0259'? Then representing any encoding as a string could avoid > > endianness issues anyway, and I don't see Unicode code point is any > > better than others. > > > > The argument will be a number, as now. > > > Also I'd like to point out all encodings has its own code point > > systems as far as I know. For example, EUC-JP has its corresponding > > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > > we can't use "code point" as chr()'s argument for othe encodings(of > > course we need optional parameter specifying which character set is > > supposed). > > > > Where can I find the tables that map code points (as opposed to > encodings) to characters for these others? You mean code point table of character set? The actual standard is not on the web since it is copyrighted by the Japanese goverment (you need to buy as a book or a pdf file). However you could find many code point tables on the web. For example, JIS X 0208 code points can be found on: http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html (you need to have a Japanese font and set page encoding to Shift JIS) BTW, if you want to pass "code point of character set" rather than encoding value, you need to give chr() what character set you are reffering to. So we need to have two arguments, one is for code point, the other is for character set specification. What do you think? -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] "txn" in pg_stat_activity
On Mon, 2007-09-10 at 21:04 -0400, Tom Lane wrote: > I have just noticed that a column "txn_start" has appeared in > pg_stat_activity since 8.2. It's a good idea, but who chose the name? Me. > I'm inclined to rename it to "xact_start", which is an abbreviation > that we *do* use in the code, and in some user-visible places, eg, > pg_stat_database has "xact_commit" and "xact_rollback" columns. I personally find "xact" to be a less intuitive abbreviation of "transaction" than "txn", but for the sake of consistency, I agree it is probably better to use "xact_start". -Neil ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] "txn" in pg_stat_activity
I have just noticed that a column "txn_start" has appeared in pg_stat_activity since 8.2. It's a good idea, but who chose the name? We do not use that abbreviation for "transaction" anywhere else in Postgres, certainly not in any user-exposed places. I'm inclined to rename it to "xact_start", which is an abbreviation that we *do* use in the code, and in some user-visible places, eg, pg_stat_database has "xact_commit" and "xact_rollback" columns. Other suggestions? If we don't change it before 8.3 beta starts, we'll be stuck with it ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [ADMIN] reindexdb hangs
Alvaro Herrera <[EMAIL PROTECTED]> writes: > I am unsure if I should backpatch to 8.1: the code in cluster.c has > changed, and while it is relatively easy to modify the patch, this is a > rare bug and nobody has reported it in CLUSTER (not many people clusters > temp tables, it seems). Should I patch only REINDEX? How far back? I'd say go as far back as you can conveniently modify the patch for. This is a potential data-loss bug (even if only for temporary data) so we ought to take it seriously. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Avery Payne wrote: > >I thought maybe we can call it COAST, Column-oriented attribute storage > technique, :-) > > I like it. :-)http://www.2ndQuadrant.com";> I > just wish I would have read this before applying for a project name > at pgfoundry, the current proposal is given as "pg-cstore". You can email the pgfoundry admins at [EMAIL PROTECTED] and ask it to be cancelled instead of approved, and resubmit with the other name. -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [ADMIN] reindexdb hangs
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > I'm not sure I follow. Are you suggesting adding a new function, > > similar to pg_class_ownercheck, which additionally checks for temp-ness? > > No, I was just suggesting adding the check for temp-ness in cluster() > and cluster_rel() where we do pg_class_ownercheck. We already have the > rel open there and so it's cheap to do the temp-ness check. I applied a patch along these lines to HEAD and 8.2. I am unsure if I should backpatch to 8.1: the code in cluster.c has changed, and while it is relatively easy to modify the patch, this is a rare bug and nobody has reported it in CLUSTER (not many people clusters temp tables, it seems). Should I patch only REINDEX? How far back? -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "The eagle never lost so much time, as when he submitted to learn of the crow." (William Blake) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
>ISTM we would be able to do this fairly well if we implemented >Index-only columns. i.e. columns that don't exist in the heap, only in >an index. >Taken to the extreme, all columns could be removed from the heap and >placed in an index(es). Only the visibility information would remain on >the heap. So, let me understand this correctly - you want to index the columns and use the index to reconstruct the data? Some kind of "implicit" reconstruction? >Doing this per column would be a big win over vertical databases >since AFAIK they *have* to do this to *every* column, even if it is not >beneficial to do so. I was thinking about something a little more crude - each column being a free-standing table, but being "viewed" by the client as a single entity, a kind of "data federation". The idea was that the existing storage mechanism wouldn't be altered, and with a little slight-of-hand, we could extend the mechanism without hampering things like clustering, future extensions, optimizations, etc. No changes to MVCC would be needed, because it would above and through it. The idea was that for a "wide" table you target only a few columns, so you could sequential read without the penalty of having to seek to a new offset for each record. Instead of processing all the columns, you process a single column, which means less data to read for the same number of records. That gain starts to slope off when you specify more and more columns from the table. Throw in selective indexing (similar to what you were talking about) and suddenly we can reclaim some of that lost speed. We'll make a kind of "compressed index", where the key turns into a hash that points to (is attached to?) a bucket, and the bucket contains the offset of all the records that relate to that key. Other tricks can be employed, such as run-length encoding entire ranges of offsets, etc. to keep this really really small. Really small = faster and faster to read, using more CPU than I/O. And I/O is still more of an issue than CPU at this point. Then again, if you ditch the column altogether and use a "compressed index" to reconstruct data implicitly, now we're close to what you were talking about (assuming I understand you correctly and also assuming that PostgreSQL doesn't already do this with indexes). So, if the column is indexed, then maybe split it off into a "compressed index", and if not, keep it in the main table outright? I guess I really need to think about this a bit more before I start delving into code. >I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) I like it. :-) I just wish I would have read this before applying for a project name at pgfoundry, the current proposal is given as "pg-cstore".
Re: [HACKERS] ispell dictionary broken in CVS HEAD ?
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> Is anyone working on providing basic regression tests for the different >> dictionary types? Seems like the main stumbling block is providing > I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So, > what is better practice to builtin it? Make it installable with regular > procedure into share/tsearch_data or install they only for regression db? You have to install them as part of the regular installation; pg_regress can't do it in the "make installcheck" case because it may not have write permissions on $SHAREDIR. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ispell dictionary broken in CVS HEAD ?
Is anyone working on providing basic regression tests for the different dictionary types? Seems like the main stumbling block is providing I make some small tests (http://www.sigaev.ru/misc/ispell_samples.tgz). So, what is better practice to builtin it? Make it installable with regular procedure into share/tsearch_data or install they only for regression db? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] GucContext of log_autovacuum
FYI, this has been committed by Tom. --- ITAGAKI Takahiro wrote: > The GucContext of log_autovacuum is PGC_BACKEND in the CVS HEAD, > but should it be PGC_SIGHUP? We cannot modify the variable on-the-fly > because the parameter is used only by autovacuum worker processes. > The similar variables, like autovacuum_vacuum_scale_factor, are > defined as PGC_SIGHUP. > > > Index: src/backend/utils/misc/guc.c > === > --- src/backend/utils/misc/guc.c (head) > +++ src/backend/utils/misc/guc.c (working copy) > @@ -1552,7 +1552,7 @@ > }, > > { > - {"log_autovacuum", PGC_BACKEND, LOGGING_WHAT, > + {"log_autovacuum", PGC_SIGHUP, LOGGING_WHAT, > gettext_noop("Sets the minimum execution time above > which autovacuum actions " >"will be logged."), > gettext_noop("Zero prints all actions. The default is > -1 (turning this feature off)."), > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > ---(end of broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] invalidly encoded strings
On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote: > BTW, I'm sure this was discussed but I forgot the conclusion: should > chr(0) throw an error? If we're trying to get rid of embedded-null > problems, seems it must. It is pointed out on wikipedia that Java sometimes uses to byte pair C0 80 to represent the NUL character to allow it to be embedded in C strings without tripping anything up. It is however technically an illegal representation (violates the minimal representation rule) and thus rejected by postgres. I'm not suggesting we copy this, but it does show there are other ways to deal with this. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Currently, we do not assume that either the childXids array, nor the xid cache in the proc array are sorted by ascending xid order. I believe that we could simplify the code, further reduce the locking requirements, and enabled a transaction to de-overflow it's xid cache if we assume that those arrays are in ascending xid order. "de-overflowing" the cache sounds completely unsafe, as other backends need that state to determine whether they need to look into pg_subtrans. We'd only de-overflow if we abort *all* xids that are missing from the xid cache. And only after marking them as aborted in the clog. If someone concurrently checks for an overflow, and already sees the new (non-overflowed) state, than he'll assume the xid is not running if he hasn't found it in the array. Which is correct - we just aborted it. Plus, removing the exclusive lock doesn't depend on de-overflowing. It's just something that seems rather easy to do once the subxid handling is in a state that allows concurrent removal of entries. If it turns out that it's not that easy, than I'll just drop the idea again. I still don't believe you can avoid taking exclusive lock, either; your argument here did not address latestCompletedXid. Sorry, not addressing latestCompletedXid was an oversight :-(. My point is the we only *need* to advance latestCompletedXid on COMMITS. We do so for aborts only to avoid running with unnecessarily low xmins after a transaction ABORT. That corner case can only happen after a toplevel ABORT, though - aborting subxacts cannot change the xmin, because the toplevel xact will have a lower xid than any of it's subtransactions anyway. We can therefore just remember the largest assigned xid for a given transaction, and update latestCompletedXid to that on toplevel commit or abort. That prevents that corner-case too, without updating latestCompletedXid during subxact abort. But the main point remains this: there is no evidence whatsoever that these code paths are sufficiently performance-critical to be worth speeding up by making the code more fragile. The gain will be less than that of the locking improvements done so far. It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks, though I think. We'll also save some cycles in TransactionIdIsInProgress, because we can use a binary search, but that's just an added bonus. I'm currently trying to code up a patch, since it's easier to judge the correctness of actual code than that of a mere proposals. I'll do some benchmarking when the patch is done to see if it brings measurable benefits. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Are we done with sync-commit-defaults-to-off patch?
Tom Lane wrote: Alvaro Herrera <[EMAIL PROTECTED]> writes: Maybe we should lower the autovac naptime too, just to make it do some more stuff (and to see if it breaks something else just because of being running). Well, Andrew has committed the pg_regress extension to allow buildfarm animals to set nondefault GUC values. So I think it'd be sufficient if a buildfarm owner or two would volunteer to run with small autovac naptime. What settings do you have in mind for a testing regime? If anyone wants help on setting this up for a buildfarm member, please let me know. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.
"Florian G. Pflug" <[EMAIL PROTECTED]> writes: > Currently, we do not assume that either the childXids array, nor > the xid cache in the proc array are sorted by ascending xid order. > I believe that we could simplify the code, further reduce the locking > requirements, and enabled a transaction to de-overflow it's xid cache > if we assume that those arrays are in ascending xid order. "de-overflowing" the cache sounds completely unsafe, as other backends need that state to determine whether they need to look into pg_subtrans. I still don't believe you can avoid taking exclusive lock, either; your argument here did not address latestCompletedXid. But the main point remains this: there is no evidence whatsoever that these code paths are sufficiently performance-critical to be worth speeding up by making the code more fragile. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] invalidly encoded strings
Tom Lane wrote: OK. Looking back, there was also some mention of changing chr's argument to bigint, but I'd counsel against doing that. We should not need it since we only support 4-byte UTF8, hence code points only up to 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points, no?). Yes, that was a thinko on my part. I realised that yesterday. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Ts_rank internals
Hi, I tried to understand how ts_rank works, but I failed. What does Cover function do? How does it work? What is the DocRepresentation data structure like? I can see the definition of the struct, and the get_docrep function to convert to that format, but by reading those I can't figure out what the resulting DocRepresentation looks like. I wonder if we could get rid of the istrue flag in QueryOperand, and use a local BitmapSet variable instead? It seems wrong to have a temporary flag that's only used in one function, in a struct that's used everywhere. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> I think Teodor's solution is wrong as it stands, because if the subquery >> finds matches for mapcfg and maptokentype, but none of those rows >> produce a non-null ts_lexize result, it will instead emit one row with a >> null result, which is not what should happen. > But concatenation with NULL will have result NULL, so "Lexized token" column > will have NULL as supposed. Ah, you're right --- objection withdrawn. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] invalidly encoded strings
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> BTW, I'm sure this was discussed but I forgot the conclusion: should >> chr(0) throw an error? > I think it should, yes. OK. Looking back, there was also some mention of changing chr's argument to bigint, but I'd counsel against doing that. We should not need it since we only support 4-byte UTF8, hence code points only up to 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points, no?). If Tatsuo can find official code-point mappings for any other MB encodings, +1 on supporting those too. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
I think Teodor's solution is wrong as it stands, because if the subquery finds matches for mapcfg and maptokentype, but none of those rows produce a non-null ts_lexize result, it will instead emit one row with a null result, which is not what should happen. But concatenation with NULL will have result NULL, so "Lexized token" column will have NULL as supposed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Uh, how will that help? AFAICS it still has to call ts_lexize with >> every dictionary. > No, ts_lexize is no longer in the seq scan filter, but in the sort key > that's calculated only for those rows that match the filter 'mapcfg=? > AND maptokentype=?'. It is pretty kludgey, though. Oh, I see: in the original formulation the planner can push the "WHERE dl.lex IS NOT NULL" clause down into the sorted subquery, and what with one thing and another that clause ends up getting evaluated first in the scan filter condition. We could prevent that by increasing ts_lexize's procost from 1 to 2 (or so), which might be a a good thing anyway since I suppose it's not especially cheap. It's still a klugy solution though. I think Teodor's solution is wrong as it stands, because if the subquery finds matches for mapcfg and maptokentype, but none of those rows produce a non-null ts_lexize result, it will instead emit one row with a null result, which is not what should happen. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] invalidly encoded strings
Tom Lane wrote: BTW, I'm sure this was discussed but I forgot the conclusion: should chr(0) throw an error? If we're trying to get rid of embedded-null problems, seems it must. I think it should, yes. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
Tatsuo Ishii wrote: I don't understand whole discussion. Why do you think that employing the Unicode code point as the chr() argument could avoid endianness issues? Are you going to represent Unicode code point as UCS-4? Then you have to specify the endianness anyway. (see the UCS-4 standard for more details) The code point is simply a number. The result of chr() will be a text value one char (not one byte) wide, in the relevant database encoding. U+ maps to the same Unicode char and hence the same UTF8 encoding pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on all machines. So to get this char in a UTF8 database you could call "select chr(169)" and get back the byte pattern \xC2A9. Or are you going to represent Unicode point as a character string such as 'U+0259'? Then representing any encoding as a string could avoid endianness issues anyway, and I don't see Unicode code point is any better than others. The argument will be a number, as now. Also I'd like to point out all encodings has its own code point systems as far as I know. For example, EUC-JP has its corresponding code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see we can't use "code point" as chr()'s argument for othe encodings(of course we need optional parameter specifying which character set is supposed). Where can I find the tables that map code points (as opposed to encodings) to characters for these others? cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] invalidly encoded strings
On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote: > Why do you think that employing the Unicode code point as the chr() > argument could avoid endianness issues? Are you going to represent > Unicode code point as UCS-4? Then you have to specify the endianness > anyway. (see the UCS-4 standard for more details) Because the argument to chr() is an integer, which has no endian-ness. You only get into endian-ness if you look at how you store the resulting string. > Also I'd like to point out all encodings has its own code point > systems as far as I know. For example, EUC-JP has its corresponding > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > we can't use "code point" as chr()'s argument for othe encodings(of > course we need optional parameter specifying which character set is > supposed). Oh, the last discussion on this didn't answer this question. Is there a standard somewhere that maps integers to characters in EUC-JP. If so, how can I find out what character 512 is? Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] invalidly encoded strings
BTW, I'm sure this was discussed but I forgot the conclusion: should chr(0) throw an error? If we're trying to get rid of embedded-null problems, seems it must. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote: > Oleg Bartunov <[EMAIL PROTECTED]> writes: > > On Mon, 10 Sep 2007, Simon Riggs wrote: > >> Can we include that functionality now? > > > This could be realized very easyly using dict_strict, which returns > > only known words, and mapping contains only this dictionary. So, > > feel free to write it and submit. > > ... for 8.4. I've coded a small patch to allow CaseSensitive synonyms. CREATE TEXT SEARCH DICTIONARY my_diction ( TEMPLATE = biglist, DictFile = words, CaseSensitive = true ); -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com Index: src/backend/tsearch/dict_synonym.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/dict_synonym.c,v retrieving revision 1.4 diff -c -r1.4 dict_synonym.c *** src/backend/tsearch/dict_synonym.c 25 Aug 2007 02:29:45 - 1.4 --- src/backend/tsearch/dict_synonym.c 10 Sep 2007 15:14:21 - *** *** 29,34 --- 29,35 typedef struct { int len; /* length of syn array */ + bool case_sensitive; Syn *syn; } DictSyn; *** *** 83,88 --- 84,90 *end = NULL; int cur = 0; char *line = NULL; + bool case_sensitive = false; foreach(l, dictoptions) { *** *** 95,100 --- 97,107 (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("unrecognized synonym parameter: \"%s\"", defel->defname))); + + if (pg_strcasecmp("CaseSensitive", defel->defname) == 0 && + pg_strcasecmp("True", defGetString(defel)) == 0) + case_sensitive = true; + } if (!filename) *** *** 168,173 --- 175,182 d->len = cur; qsort(d->syn, d->len, sizeof(Syn), compareSyn); + d->case_sensitive = case_sensitive; + PG_RETURN_POINTER(d); } *** *** 180,195 Syn key, *found; TSLexeme *res; /* note: d->len test protects against Solaris bsearch-of-no-items bug */ if (len <= 0 || d->len <= 0) PG_RETURN_POINTER(NULL); ! key.in = lowerstr_with_len(in, len); key.out = NULL; found = (Syn *) bsearch(&key, d->syn, d->len, sizeof(Syn), compareSyn); ! pfree(key.in); if (!found) PG_RETURN_POINTER(NULL); --- 189,214 Syn key, *found; TSLexeme *res; + bool need_pfree = false; /* note: d->len test protects against Solaris bsearch-of-no-items bug */ if (len <= 0 || d->len <= 0) PG_RETURN_POINTER(NULL); ! if (d->case_sensitive) ! key.in = in; ! else ! { ! key.in = lowerstr_with_len(in, len); ! need_pfree = true; ! } ! key.out = NULL; found = (Syn *) bsearch(&key, d->syn, d->len, sizeof(Syn), compareSyn); ! ! if (need_pfree) ! pfree(key.in); if (!found) PG_RETURN_POINTER(NULL); ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] invalidly encoded strings
> Andrew Dunstan <[EMAIL PROTECTED]> writes: > > The reason we are prepared to make an exception for Unicode is precisely > > because the code point maps to an encoding pattern independently of > > architecture, ISTM. > > Right --- there is a well-defined standard for the numerical value of > each character in Unicode. And it's also clear what to do in > single-byte encodings. It's not at all clear what the representation > ought to be for other multibyte encodings. A direct transliteration > of the byte sequence not only has endianness issues, but will have > a weird non-dense set of valid values because of the restrictions on > valid multibyte characters. > > Given that chr() has never before behaved sanely for multibyte values at > all, extending it to Unicode code points is a reasonable extension, > and throwing error for other encodings is reasonable too. If we ever do > come across code-point standards for other encodings we can adopt 'em at > that time. I don't understand whole discussion. Why do you think that employing the Unicode code point as the chr() argument could avoid endianness issues? Are you going to represent Unicode code point as UCS-4? Then you have to specify the endianness anyway. (see the UCS-4 standard for more details) Or are you going to represent Unicode point as a character string such as 'U+0259'? Then representing any encoding as a string could avoid endianness issues anyway, and I don't see Unicode code point is any better than others. Also I'd like to point out all encodings has its own code point systems as far as I know. For example, EUC-JP has its corresponding code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see we can't use "code point" as chr()'s argument for othe encodings(of course we need optional parameter specifying which character set is supposed). -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hash index todo list item
On Sat, Sep 08, 2007 at 06:56:23PM -0400, Mark Mielke wrote: > I think that if the case of >1 entry per hash becomes common enough to > be significant, and the key is stored in the hash, that a btree will > perform equal or better, and there is no point in pursuing such a hash > index model. This is where we are today. There is the point that if a user does an UPDATE of a row without changing the key your index will have to store entries with the same hash. If your goal is mostly write-once tables, then that's cool, but otherwise you're going to have to find a way of dealing with that. It's not just going to happen a lot, it is going to be common, even for unique indexes. Presumably HOT will help with this, but that relies on all index columns not to change. The major benenfits will mostly come from not storing the key at all. I think. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Mark Mielke wrote: > Simon Riggs wrote: >> ISTM we would be able to do this fairly well if we implemented >> Index-only columns. i.e. columns that don't exist in the heap, only in >> an index. >> Taken to the extreme, all columns could be removed from the heap and >> placed in an index(es). Only the visibility information would remain on >> the heap. >> > Wouldn't the extreme be - even the visibility information is in the index? Maybe we could extract the visibility information and put it in a storage separate from the heap. A seqscan would be a bit slower (have to check the heap and the visibility storage) but some index scans could be faster (can check the index and visibility storage, skipping the heap), and updates would be faster too (append into the heap, update visibility storage). This would allow something closer to index-only scans. Of course, the key is to allow efficient lookup of visibility info given a TID. Has this been explored before? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
Tom Lane wrote: > Teodor Sigaev <[EMAIL PROTECTED]> writes: >>> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, >>> $1). That means that it will call ts_lexize on every dictionary, which >>> will try to load every dictionary. And loading danish_stem dictionary >>> fails in latin2 encoding, because of the problem with the stopword file. > >> Attached patch should fix it, I hope. > > Uh, how will that help? AFAICS it still has to call ts_lexize with > every dictionary. No, ts_lexize is no longer in the seq scan filter, but in the sort key that's calculated only for those rows that match the filter 'mapcfg=? AND maptokentype=?'. It is pretty kludgey, though. The planner could choose another plan, that fails, if the statistics were different. Rewriting the function in C would be a more robust fix. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
Simon Riggs wrote: ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would remain on the heap. Wouldn't the extreme be - even the visibility information is in the index? :-) Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
Teodor Sigaev <[EMAIL PROTECTED]> writes: >> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, >> $1). That means that it will call ts_lexize on every dictionary, which >> will try to load every dictionary. And loading danish_stem dictionary >> fails in latin2 encoding, because of the problem with the stopword file. > Attached patch should fix it, I hope. Uh, how will that help? AFAICS it still has to call ts_lexize with every dictionary. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote: Kenneth Marshall <[EMAIL PROTECTED]> writes: ... This is the rough plan. Does anyone see anything critical that is missing at this point? Sounds pretty good. Let me brain-dump one item on you: one thing that hash currently has over btree is the ability to handle index items up to a full page. Now, if you go with a scheme that only stores hash codes and not the underlying data, you can not only handle that but improve on it; but if you reduce the bucket size and don't remove the data, it'd be a step backward. The idea I had about dealing with that was to only reduce the size of primary buckets --- if it's necessary to add overflow space to a bucket, the overflow units are still full pages. So an index tuple up to a page in size can always be accommodated by adding an overflow page to the bucket. Just a thought, but AFAIR it's not in the archives anywhere. regards, tom lane I was thinking about this some more, and it strikes me that we can keep the page size = bucket size = overflow size in the new scheme of storing the hash value in the index and still reduce the effective bucket size. Let's make the new size the (page size / 2^n) where n is chosen appropriately. Then we we store the value in the page we simply use n bits of the hash to determine which sub-piece of the page to use to actually store the value. We may need to add a multiplier to adjust the decision to split the page based on the mini-page. This should allow us to much more densely pack the pages and approach the 1 item per bucket. This could easily shrink the size of the index by a factor of 2^n. Let me know what you think. My personal opinion is that something like this is required to take best advantage of hashes. I didn't respond immediately because I don't have advice on what the best implementation would look like. I have also been wondering about the effect of a hash index that includes multiple values to the same key (either a non-unique key, or different tuples from the same key due to table updates). I started by thinking that the maximum number of hash entries per hash bucket should be kept to 2 or 3 to prevent reduction in performance to that of btree, but I think this doesn't work if multiple tuples can have the same key. Unless - the maps is hash value =1:1> index page =1:1> hash bucket =1:N> hash value =1:M=> tuples. Guarantee than N is small (either <= 2 or <=4 depending on performance evaluation) by re-hashing if N ever becomes > 2 or > 4. Choose a sparse harsh. Let 1:M be indefinite? Also, optimize the 1:M=1:1 case, such that the value can be inline? For most cases, I would think the above model would make it cheap to determine if the key does not exist, as well as the 1:1 (hash value:key) case requiring a single page lookup. Update performance would suffer, but lookup should be faster than btree in these cases, as btree often requires > 1 index page scan. Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]>
Re: [HACKERS] Hash index todo list item
More random thoughts: - Hash-Indices are best for unique keys, but every table needs a new hash key, which means one more random page access. Is there any way to build multi-_table_ indices? A join might then fetch all table rows with a given unique key after one page fetch for the combined index. - Hashes with trivial hash-functions (like identity) can also return rows in a desired order. - Is there a case where a sequentially scanning a hash-index is useful? I can't find any, but maybe somebody else has a use-case. - What about HashJoins when the referenced tables have hash-indices? - What about hash-indices where entries are inserted for multiple columns. Assume a table like this: CREATE TABLE link (obj_id1 INT4, obj_id2 INT4); and a query like SELECT * FROM link WHERE ? IN (obj_id1, obj_id2); or some join using a similar condition. It might be a nice thing to insert entries at both HASH(obj_id1) and HASH(obj_id2), which would eliminate the need to check in two indices and do a bitmap OR. OTOH it might not be faster in any significant use cases because who'd need a link table with nearly unique linked objects? - In cases where the distribution of the hash-function is good, but a small and relatively even number of rows exist for each key (like it might be the case in the above example), it might be nice to reserve a given amount of same-key row entries in each bucket, and hold a fill-count at the front of it. That would avoid costly page fetches after each collision. You'd create a hash-index with n-buckets, each m-elements large. When the bucket is full, the usual collision handling continues. - About hash enlargement: What about always using only the first k bits of each hash value. When you find that the hash is "quite-full" (however that is defined and detected), k is increased by one, effectively doubling the hash size. New entries are then written as usual, while retrieving the old entries needs to test at the k-bit-position first and if there is a miss, also at the k-1-position and so forth. To limit this search, some background process could after analyzing the index move old entries to the now correct k-bit-position and increment some "min-k"-value once all old entries have been moved. After the hash has been increased, the index would approximately half it's speed for some time then. Additionally one could also insert the entry at the new position if it has been found at the old one only while using the index. A special "miss"-entry at the new position doesn't help if nothing could be found because the old positions will usually hold some data which resides there even if it uses k bits. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: On Mon, 10 Sep 2007, Simon Riggs wrote: Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So, feel free to write it and submit. ... for 8.4. It can be just a contrib module. There are several useful dictionaries we need to port. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] invalidly encoded strings
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Perhaps we're talking at cross purposes. The problem with doing encoding validation in scan.l is that it lacks context. Null bytes are only the tip of the bytea iceberg, since any arbitrary sequence of bytes can be valid for a bytea. If you think that, then we're definitely talking at cross purposes. I assert we should require the post-scanning value of a string literal to be valid in the database encoding. If you want to produce an arbitrary byte sequence within a bytea value, the way to get there is for the bytea input function to do the de-escaping, not for the string literal parser to do it. [looks again ... thinks ...] Ok, I'm sold. The only reason I was considering not doing it in scan.l is that scan.l's behavior ideally shouldn't depend on any changeable variables. But until there's some prospect of database_encoding actually being mutable at run time, there's not much point in investing a lot of sweat on that either. Agreed. This would just be one item of many to change if/when we ever come to that. Instead, we have to mess with an unknown number of UDTs ... We're going to have that danger anyway, aren't we, unless we check the encoding validity of the result of every UDF that returns some text type? I'm not going to lose sleep over something I can't cure but the user can - what concerns me is that our own code ensures data intregrity, including for encoding. Anyway, it looks to me like we have the following items to do: . add validity checking to the scanner . fix COPY code . fix chr() . improve efficiency of validity checks, at least for UTF8 cheers andrew cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Include Lists for Text Search
Oleg Bartunov <[EMAIL PROTECTED]> writes: > On Mon, 10 Sep 2007, Simon Riggs wrote: >> Can we include that functionality now? > This could be realized very easyly using dict_strict, which returns > only known words, and mapping contains only this dictionary. So, > feel free to write it and submit. ... for 8.4. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] invalidly encoded strings
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Those should be checked already --- if not, the right fix is still to >> fix it there, not in per-datatype code. I think we are OK though, >> eg see "need_transcoding" logic in copy.c. > Well, a little experimentation shows that we currently are not OK: > in foo.data: > \366\66 I looked at this and found that the problem is that CopyReadAttributesText() does backslash conversions and doesn't validate the result. My proposal at this point is that both scan.l and CopyReadAttributesText need to guarantee that the result of de-escaping is valid in the database encoding: they should reject \0 and check validity of multibyte characters. You could optimize this fairly well (important in the COPY case) by not redoing verifymbstr unless you'd seen at least one numerical backslash-escape that set the high bit. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] invalidly encoded strings
Andrew Dunstan <[EMAIL PROTECTED]> writes: > The reason we are prepared to make an exception for Unicode is precisely > because the code point maps to an encoding pattern independently of > architecture, ISTM. Right --- there is a well-defined standard for the numerical value of each character in Unicode. And it's also clear what to do in single-byte encodings. It's not at all clear what the representation ought to be for other multibyte encodings. A direct transliteration of the byte sequence not only has endianness issues, but will have a weird non-dense set of valid values because of the restrictions on valid multibyte characters. Given that chr() has never before behaved sanely for multibyte values at all, extending it to Unicode code points is a reasonable extension, and throwing error for other encodings is reasonable too. If we ever do come across code-point standards for other encodings we can adopt 'em at that time. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking for and what they will never be looking for, it seems easier to supply that list up front, rather than hide it behind lots of hand-crafted code. Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So, feel free to write it and submit. So there isn't one yet, but you think it will be easy to write and that we should call it dict_strict? we have dict_synonym already and if your list is not big you'll be happy. So I need to do something like CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, synonym = include_only_these_words ); which will then look for a file called include_only_these_words.syn? I would prefer to be able to do something like this CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, include = justthese ); ...which makes more sense to anyone reading it and I also want to make the comparison case insensitive. Would it be better to 1. include a new dictionary file (dict_strict, as you suggest) 2. a) allow case sensitivity as another option in dictionaries b) allow "include" as another word for "stoplist", but with the meaning reversed? e.g. CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, include = justthese, case_sensitive = true ); No, you need to write new template, which efficiently works with big lists and support case insensitive comparison. CREATE TEXT SEARCH TEMPLATE biglist ( . ); CREATE TEXT SEARCH DICTIONARY my_diction ( TEMPLATE = biglist, DictFile = words, case_sensitive = true ); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote: There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking GIN is great, sorry if that sounded negative. GIN doesn't depend strongly on number of words. It has log(N) behaviour for numbers of words because of using B-Tree over words. log(N) in the number of distinct words, but every word you index results in an index insert, so if we index more words than we need then the insert rate will go down. yes, there is room to improve support of very long posting lists Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] invalidly encoded strings
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Perhaps we're talking at cross purposes. > The problem with doing encoding validation in scan.l is that it lacks > context. Null bytes are only the tip of the bytea iceberg, since any > arbitrary sequence of bytes can be valid for a bytea. If you think that, then we're definitely talking at cross purposes. I assert we should require the post-scanning value of a string literal to be valid in the database encoding. If you want to produce an arbitrary byte sequence within a bytea value, the way to get there is for the bytea input function to do the de-escaping, not for the string literal parser to do it. The current situation where there is overlapping functionality is a bit unfortunate, but once standard_conforming_strings is on by default, it'll get a lot easier to work with. I'm not eager to contort APIs throughout the backend in order to produce a more usable solution for the standard_conforming_strings = off case, given the expected limited lifespan of that usage. The only reason I was considering not doing it in scan.l is that scan.l's behavior ideally shouldn't depend on any changeable variables. But until there's some prospect of database_encoding actually being mutable at run time, there's not much point in investing a lot of sweat on that either. > I still don't see why it's OK for us to do validation from the foo_recv > functions but not the corresponding foo_in functions. Basically, a CSTRING handed to an input function should already have been encoding-validated by somebody. The most obvious reason why this must be so is the embedded-null problem, but in general it will already have been validated (typically as part of a larger string such as the whole SQL query or whole COPY data line), and doing that over is pointless and expensive. On the other hand, the entire point of a recv function is that it gets raw data that no one else in the backend knows the format of; so if the data is to be considered textual, the recv function has to be the one that considers it so and invokes appropriate conversion or validation. The reason backslash escapes in string literals are a problem is that they can produce incorrect-encoding results from what had been a validated string. > At least in the > short term that would provide us with fairly complete protection against > accepting invalidly encoded data into the database, once we fix up > chr(), without having to mess with the scanner, parser, COPY code etc. Instead, we have to mess with an unknown number of UDTs ... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] invalidly encoded strings
Albe Laurenz wrote: I'd like to repeat my suggestion for chr() and ascii(). Instead of the code point, I'd prefer the actual encoding of the character as argument to chr() and return value of ascii(). [snip] Of course, if it is generally perceived that the code point is more useful than the encoding, then Oracle compliance is probably secondary. Last time this was discussed, you were the only person arguing for that behaviour, IIRC. And frankly, I don't know how to do it sanely anyway. A character encoding has a fixed byte pattern, but a given byte pattern doesn't have a single universal number value. I really don't think we want to have the value of chr(n) depend on the endianness of the machine, do we? The reason we are prepared to make an exception for Unicode is precisely because the code point maps to an encoding pattern independently of architecture, ISTM. cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 16:48 +0400, Teodor Sigaev wrote: > > There are clear indications that indexing too many words is a problem > > for both GIN and GIST. If people already know what they'll be looking GIN is great, sorry if that sounded negative. > GIN doesn't depend strongly on number of words. It has log(N) behaviour for > numbers of words because of using B-Tree over words. log(N) in the number of distinct words, but every word you index results in an index insert, so if we index more words than we need then the insert rate will go down. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 16:35 +0400, Oleg Bartunov wrote: > On Mon, 10 Sep 2007, Simon Riggs wrote: > > > On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: > >> On Mon, 10 Sep 2007, Simon Riggs wrote: > >> > >>> It seems possible to write your own functions to support various > >>> possibilities with text search. > >>> > >>> One of the more common thoughts is to have a list of words that you > >>> would like to include, i.e. the opposite of a stop word list. > >>> > >>> There are clear indications that indexing too many words is a problem > >>> for both GIN and GIST. If people already know what they'll be looking > >>> for and what they will never be looking for, it seems easier to supply > >>> that list up front, rather than hide it behind lots of hand-crafted > >>> code. > >>> > >>> Can we include that functionality now? > >> > >> This could be realized very easyly using dict_strict, which returns > >> only known words, and mapping contains only this dictionary. So, > >> feel free to write it and submit. > > > > So there isn't one yet, but you think it will be easy to write and that > > we should call it dict_strict? > > we have dict_synonym already and if your list is not big you'll be happy. So I need to do something like CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, synonym = include_only_these_words ); which will then look for a file called include_only_these_words.syn? I would prefer to be able to do something like this CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, include = justthese ); ...which makes more sense to anyone reading it and I also want to make the comparison case insensitive. Would it be better to 1. include a new dictionary file (dict_strict, as you suggest) 2. a) allow case sensitivity as another option in dictionaries b) allow "include" as another word for "stoplist", but with the meaning reversed? e.g. CREATE TEXT SEARCH DICTIONARY my_diction ( template = snowball, include = justthese, case_sensitive = true ); -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ispell dictionary broken in CVS HEAD ?
Tom Lane wrote: > I don't know enough about ispell to > understand what its config files look like. (There's a problem of > missing documentation here, too...) Yeah :(. The file format that ispell accepts is kind of ad hoc. It accepts hunspell and ispell and myspell variants, but only a subset of the full grammar (some stuff is not relevant for tsearch). A description of what exactly it's supposed to accept would be nice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Include Lists for Text Search
There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking GIN doesn't depend strongly on number of words. It has log(N) behaviour for numbers of words because of using B-Tree over words. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: > On Mon, 10 Sep 2007, Simon Riggs wrote: > > > It seems possible to write your own functions to support various > > possibilities with text search. > > > > One of the more common thoughts is to have a list of words that you > > would like to include, i.e. the opposite of a stop word list. > > > > There are clear indications that indexing too many words is a problem > > for both GIN and GIST. If people already know what they'll be looking > > for and what they will never be looking for, it seems easier to supply > > that list up front, rather than hide it behind lots of hand-crafted > > code. > > > > Can we include that functionality now? > > This could be realized very easyly using dict_strict, which returns > only known words, and mapping contains only this dictionary. So, > feel free to write it and submit. So there isn't one yet, but you think it will be easy to write and that we should call it dict_strict? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Include Lists for Text Search
How does that allow me to limit the number of words to a known list? If all dictionaries returns NULL for token the this token will not be indexed at all. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ispell dictionary broken in CVS HEAD ?
Is anyone working on providing basic regression tests for the different dictionary types? Seems like the main stumbling block is providing I'll do some tests for dictionaries, but it will be synthetic dictionary. Original ispell files is rather big, so I'll make rather simple and small one. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 16:10 +0400, Oleg Bartunov wrote: On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking for and what they will never be looking for, it seems easier to supply that list up front, rather than hide it behind lots of hand-crafted code. Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So, feel free to write it and submit. So there isn't one yet, but you think it will be easy to write and that we should call it dict_strict? we have dict_synonym already and if your list is not big you'll be happy. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Simon Riggs wrote: On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote: Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking for and what they will never be looking for, it seems easier to supply that list up front, rather than hide it behind lots of hand-crafted code. I don't understand what you're proposing. We already have dict_synonym that you can use to accept a simple list of words. How does that allow me to limit the number of words to a known list? text search doesn't index unknown words, so if your mapping contains only one dictionary, this dictionary will control what words to index. While dict_synonym is good for not big list I'd write separate dictionary with fast lookup. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 12:58 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It seems possible to write your own functions to support various > > possibilities with text search. > > > > One of the more common thoughts is to have a list of words that you > > would like to include, i.e. the opposite of a stop word list. > > > > There are clear indications that indexing too many words is a problem > > for both GIN and GIST. If people already know what they'll be looking > > for and what they will never be looking for, it seems easier to supply > > that list up front, rather than hide it behind lots of hand-crafted > > code. > > I don't understand what you're proposing. We already have dict_synonym > that you can use to accept a simple list of words. How does that allow me to limit the number of words to a known list? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] invalidly encoded strings
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: In the short run it might be best to do it in scan.l after all. I have not come up with a way of doing that and handling the bytea case. AFAICS we have no realistic choice other than to reject \0 in SQL literals; to do otherwise requires API changes throughout that stack of modules. And once you admit \0 is no good, it's not clear that \somethingelse is any better for bytea-using clients. Moreover, given that we are moving away from backslash escapes as fast as we can sanely travel, expending large amounts of energy to make them work better doesn't seem like a good use of development manpower. Perhaps we're talking at cross purposes. The problem with doing encoding validation in scan.l is that it lacks context. Null bytes are only the tip of the bytea iceberg, since any arbitrary sequence of bytes can be valid for a bytea. So we can only do validation of encoding on a literal when we know it isn't destined for a bytea. That's what I haven't come up with a way of doing in the scanner (and as you noted upthread it's getting pretty darn late in the cycle for us to be looking for ways to do things). I still don't see why it's OK for us to do validation from the foo_recv functions but not the corresponding foo_in functions. At least in the short term that would provide us with fairly complete protection against accepting invalidly encoded data into the database, once we fix up chr(), without having to mess with the scanner, parser, COPY code etc. We could still get corruption from UDFs and UDTs - it's hard to see how we can avoid that danger. Yes, we would need to make sure that any additions to the type system acted properly, and yes we should fix up any validation inefficiencies (like possibly inlining calls in the UTF8 validation code). Personally, I don't see those as killer objections. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] integrated tsearch doesn't work with non utf8 database
Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict, $1). That means that it will call ts_lexize on every dictionary, which will try to load every dictionary. And loading danish_stem dictionary fails in latin2 encoding, because of the problem with the stopword file. Attached patch should fix it, I hope. New plan: Hash Join (cost=2.80..1073.85 rows=80 width=100) Hash Cond: (parse.tokid = tt.tokid) InitPlan -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) Filter: (oid = 11308::oid) -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4) Filter: (oid = 11308::oid) -> Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000 width=36) -> Hash (cost=0.20..0.20 rows=16 width=68) -> Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16 width=68) SubPlan -> Limit (cost=6.57..6.60 rows=1 width=36) -> Subquery Scan dl (cost=6.57..6.60 rows=1 width=36) -> Sort (cost=6.57..6.58 rows=1 width=8) Sort Key: ((ts_lexize(m.mapdict, $1) IS NULL)), m.mapseqno -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1 width=8) Filter: ((mapcfg = 11308::oid) AND (maptokentype = $0)) -> Sort (cost=6.57..6.57 rows=1 width=8) Sort Key: m.mapseqno -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1 width=8) Filter: ((mapcfg = 11308::oid) AND (maptokentype = $0)) At least, it checks only needed dictionaries. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ *** ./src/backend/catalog/system_views.sql.orig Mon Sep 10 15:51:27 2007 --- ./src/backend/catalog/system_views.sql Mon Sep 10 16:09:52 2007 *** *** 415,422 ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS lex FROM pg_catalog.pg_ts_config_map AS m WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid ! ORDER BY m.mapseqno ) dl ! WHERE dl.lex IS NOT NULL LIMIT 1 ) AS "Lexized token" FROM pg_catalog.ts_parse( --- 415,421 ( SELECT mapdict, pg_catalog.ts_lexize(mapdict, parse.token) AS lex FROM pg_catalog.pg_ts_config_map AS m WHERE m.mapcfg = $1 AND m.maptokentype = parse.tokid ! ORDER BY pg_catalog.ts_lexize(mapdict, parse.token) IS NULL, m.mapseqno ) dl LIMIT 1 ) AS "Lexized token" FROM pg_catalog.ts_parse( ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Include Lists for Text Search
On Mon, 10 Sep 2007, Simon Riggs wrote: It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking for and what they will never be looking for, it seems easier to supply that list up front, rather than hide it behind lots of hand-crafted code. Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So, feel free to write it and submit. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Include Lists for Text Search
Simon Riggs wrote: > It seems possible to write your own functions to support various > possibilities with text search. > > One of the more common thoughts is to have a list of words that you > would like to include, i.e. the opposite of a stop word list. > > There are clear indications that indexing too many words is a problem > for both GIN and GIST. If people already know what they'll be looking > for and what they will never be looking for, it seems easier to supply > that list up front, rather than hide it behind lots of hand-crafted > code. I don't understand what you're proposing. We already have dict_synonym that you can use to accept a simple list of words. But that doesn't change the way GIN and GiST works. ? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] A Silly Idea for Vertically-Oriented Databases
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote: > So I've been seeing/hearing all of the hoopla over vertical databases > (column stores), and how they'll not only slice bread but also make > toast, etc. I've done some quick searches for past articles on > "C-Store", "Vertica", "Column Store", and "Vertical Database", and have > seen little discussion on this. I looked at doing this a while back, for similar reasons. ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would remain on the heap. Syntax for this would be an ALTER TABLE SET STORAGE command, with a new type of storage definition that will only be accepted if an index already has been defined on the table which includes the specified column. Doing this per column would be a big win over vertical databases since AFAIK they *have* to do this to *every* column, even if it is not beneficial to do so. Every existing index plan works immediately. The main annoyance is retrieving a column value that doesn't exist on the heap. That would require a new kind of plan that involves preparing the index(es) by sorting them on htid and then doing a left merge join with the main heap. By now, most people will be screaming at their monitors "what an idiot, thats gonna REALLY suck". True, but this is the same thing that column-oriented databases have to do also, so it would be taking the best that vertical databases have to offer and accepting the consequences. There are some other plan possibilities also, apart from this basic value retrieval, but they would require further index API changes; I'm not certain of those as being primary use cases, however. Vertical databases honestly do have their uses and there are many kinds of marketing query that have complex where clauses yet only simple select clauses. There are a number of industry-specific database products that have utilised this technique to good effect for a number of years. So ISTM the main changes would be executor changes to allow retrieving column values of index-only columns when they are required, and to modify the insert/update tuple code somewhat. I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Include Lists for Text Search
It seems possible to write your own functions to support various possibilities with text search. One of the more common thoughts is to have a list of words that you would like to include, i.e. the opposite of a stop word list. There are clear indications that indexing too many words is a problem for both GIN and GIST. If people already know what they'll be looking for and what they will never be looking for, it seems easier to supply that list up front, rather than hide it behind lots of hand-crafted code. Can we include that functionality now? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Hash index todo list item
--On Samstag, September 08, 2007 18:56:23 -0400 Mark Mielke <[EMAIL PROTECTED]> wrote: Kenneth Marshall wrote: Along with the hypothetical performance wins, the hash index space efficiency would be improved by a similar factor. Obviously, all of these ideas would need to be tested in various workload environments. In the large index arena, 10^6 to 10^9 keys and more, space efficiency will help keep the index manageable in todays system memories. Space efficiency is provided by not storing the key, nor the header data required (length prefix?). Space efficiency at ~1 entry per bucket: How about using closed hashing, saving in each page a bitmask in front which specifies which entries hold valid entries and in the rest of the page row-pointers (is this the correct expression? I don't know...) without further data. Should provide reasonably simple data structure and alignment for the pointers. Please keep the ideas and comments coming. I am certain that a synthesis of them will provide an implementation with the performance characteristics that we are seeking. One should look into new plan nodes for "!= ANY()", "NOT EXISTS" and similar. A node like "look into hash and true if bucket is empty" would work without checking tuple visibility when the bucket is empty and could be a win in some situations. Do we want special cases for short keys like INT4? In those cases the implementation might use hash == key and put that knowledge to use in plans. Even a unique constraint might then be doable. Does the postgresql-storage backend on linux support sparse files? Might be a win when holes in the sequence turn up. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] invalidly encoded strings
>>> I think the concern is when they use only one slash, like: >>> E'\377\000\377'::bytea >>> which, as I mentioned before, is not correct anyway. > > Wait, why would this be wrong? How would you enter the three byte bytea of > consisting of those three bytes described above? Either as E'\\377\\000\\377' or '\377\000\377' /Dennis ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] invalidly encoded strings
"Tom Lane" <[EMAIL PROTECTED]> writes: > Jeff Davis <[EMAIL PROTECTED]> writes: > >> I think the concern is when they use only one slash, like: >> E'\377\000\377'::bytea >> which, as I mentioned before, is not correct anyway. Wait, why would this be wrong? How would you enter the three byte bytea of consisting of those three bytes described above? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] invalidly encoded strings
Tom Lane wrote: >> . for chr() under UTF8, it seems to be generally agreed >> that the argument should represent the codepoint and the >> function should return the correspondingly encoded character. >> If so, possible the argument should be a bigint to >> accommodate the full range of possible code points. >> It is not clear what the argument should represent for other >> multi-byte encodings for any argument higher than 127. >> Similarly, it is not clear what ascii() should return in >> such cases. I would be inclined just to error out. > > In SQL_ASCII I'd argue for allowing 0..255. In actual MB > encodings, OK with throwing error. I'd like to repeat my suggestion for chr() and ascii(). Instead of the code point, I'd prefer the actual encoding of the character as argument to chr() and return value of ascii(). The advantage I see is the following: - It would make these functions from oracle_compat.c compatible with Oracle (Oracle's chr() and ascii() work the way I suggest). I agree with Tom's earlier suggestion to throw an error for chr(0), although this is not what Oracle does. Of course, if it is generally perceived that the code point is more useful than the encoding, then Oracle compliance is probably secondary. Yours, Laurenz Albe ---(end of broadcast)--- TIP 6: explain analyze is your friend