Re: [HACKERS] HOT pgbench results
Heikki Linnakangas [EMAIL PROTECTED] wrote: I ran some CPU intensive pgbench tests on HOT. Results are not surprising, HOT makes practically no difference on the total transaction rate, but reduces the need to vacuum: unpatched HOT tps 3680 3790 WAL written(MB) 5386 4804 checkpoints 109 autovacuums 116 43 autoanalyzes139 60 I also ran pgbench with/without HOT using a bit different configurations (pgbench -s10 -c10 -t50). There were 10% performance win on HOT, although the test was CPU intensive and with FILLFACTOR=100%. unpatched HOT tps 3366 3634 WAL written(MB) 4969 4374 checkpoints 9 8 autovacuums 126 42 autoanalyzes146 59 I gathered oprofile logs. There were 4 HOT-related functions, that didn't appear in the unpatched test. But it is probably not so serious. - heap_page_prune 1.84% - PageRepairFragmentation 0.94% - pg_qsort 0.44% (called from PageRepairFragmentation) On the other hand, the number of _bt_compare and _bt_checkkeys were reduced by HOT, because we avoid the most part of index insertions. It looks like LWLockAcquire/Release were also reduced, but I cannot assure it is a benefits of HOT or a measurement deviation. unpatched HOT % symbol name 4.08674.2314AllocSetAlloc 2.78392.8022base_yyparse 1.8392heap_page_prune 1.84591.6659SearchCatCache 1.74051.6087MemoryContextAllocZeroAligned 1.69361.5743hash_search_with_hash_value 1.06721.1822base_yylex 1.24301.1570XLogInsert 0.9356PageRepairFragmentation 1.35490.8911LWLockAcquire 1.09770.8663LWLockRelease 0.80180.7284nocachegetattr 0.75680.7124FunctionCall2 0.52640.6795ScanKeywordLookup 0.71150.6462hash_any 0.73990.5963AllocSetFree 0.66500.5925GetSnapshotData 0.55360.5789MemoryContextAlloc 0.56430.5547hash_seq_search 0.46600.5005expression_tree_walker 0.52930.4777ExecInitExpr 0.4381pg_qsort 0.43760.4321hash_uint32 0.41600.4268expression_tree_mutator 0.43220.4183LockAcquire 0.69330.3911_bt_compare 0.52700.3828PinBuffer 0.40250.3798fmgr_info_cxt_security 0.44580.3758MemoryContextAllocZero 0.5101 _bt_checkkeys Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] HOT pgbench results
Thanks for the testing, ITAGAKI Takahiro wrote: I gathered oprofile logs. There were 4 HOT-related functions, that didn't appear in the unpatched test. But it is probably not so serious. - heap_page_prune 1.84% - PageRepairFragmentation 0.94% - pg_qsort 0.44% (called from PageRepairFragmentation) That's expected. Those functions are involved in removing the dead HOT tuples, replacing VACUUMs. Maybe we could make them cheaper, but it's not too bad as it is. On the other hand, the number of _bt_compare and _bt_checkkeys were reduced by HOT, because we avoid the most part of index insertions. It looks like LWLockAcquire/Release were also reduced, but I cannot assure it is a benefits of HOT or a measurement deviation. It could very well be real. Because of the reduction of index insertions, there's less locking of the index pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] HOT patch, missing things
On 8/9/07, Tom Lane [EMAIL PROTECTED] wrote: Yeah, we could simply insist on no change to any column that's used by any of the expressions. That would be cheap to test. I am trying to figure out the best way to extract this information. Is there any existing code to get all attributes used in the expressions ? Or do I need to walk the tree and extract that information ? Thanks, Pavan Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] HOT patch, missing things
On 8/9/07, Simon Riggs [EMAIL PROTECTED] wrote: On Thu, 2007-08-09 at 15:46 +0530, Pavan Deolasee wrote: What if we just track the amount of potentially dead space in the relation (somebody had suggested that earlier in the thread) ? Every committed UPDATE/DELETE and aborted UPDATE/INSERT would increment the dead space. Whenever page fragmentation is repaired, either during normal operation or during vacuum, the dead space is reduced by the amount of reclaimed space. Autovacuum triggers whenever the percentage of dead space increases beyond a threshold. We can some fine tuning to track the space consumed by redirect-dead line pointers. Sounds great. So do we have consensus here ? Fortunately, I think there won't be any changes to user interface. Users can still use the vacuum_scale_factor to tune autovacuum, but instead of percentage of dead tuples, it would signify percentage of dead space in the relation. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] HOT patch, missing things
On Tue, 2007-08-14 at 13:24 +0530, Pavan Deolasee wrote: On 8/9/07, Simon Riggs [EMAIL PROTECTED] wrote: On Thu, 2007-08-09 at 15:46 +0530, Pavan Deolasee wrote: What if we just track the amount of potentially dead space in the relation (somebody had suggested that earlier in the thread) ? Every committed UPDATE/DELETE and aborted UPDATE/INSERT would increment the dead space. Whenever page fragmentation is repaired, either during normal operation or during vacuum, the dead space is reduced by the amount of reclaimed space. Autovacuum triggers whenever the percentage of dead space increases beyond a threshold. We can some fine tuning to track the space consumed by redirect-dead line pointers. Sounds great. So do we have consensus here ? Fortunately, I think there won't be any changes to user interface. Users can still use the vacuum_scale_factor to tune autovacuum, but instead of percentage of dead tuples, it would signify percentage of dead space in the relation. We have some consensus, but no complete design. My understanding is that we would see the following things tracked in pg_stats_xxx n_tup_ins count of rows inserted n_tup_upd count of rows updated (incl HOT and cold) n_tup_del count of rows deleted - the above are required because they are already there and useful too n_tup_hot_upd count of rows updated by HOT method only - the above is required to help tune HOT/cold updates dead_space total number of dead bytes in table - the above is required for autovacuum -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Testing the async-commit patch
On Mon, 2007-08-13 at 16:27 -0400, Andrew Dunstan wrote: Let's say that this file looks just like a postgresql.conf file, except that any line beginning with '[identifier]' is a config set name for the lines that follow. So we might have: [asynch_commit] synchronous_commit = off [no_fsync] fsync = off [csvlogs] start_log_collector = true log_destination = 'stderr, csvlog' Then there would be an extra installcheck-parallel run for each set. If the file isn't there we do nothing. Sounds fine, though I'd prefer this in XML to allow further extensibility in the future. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] 2D partitioning of VLDB - sane or not?
Which brings us back to the original issue. If I decide to stick with the current implementation and not improve our existing partitioning mechanisms to scale to 100,000 partitions, I could do something like: There is a point where you can leave the selection of the correct rows to normal btree indexes. I'd say that that point currently is well below 2000 partitions for all common db systems. I opt, that more partitions will only be useful for very limited use cases, and would be very interested in hearing of a practical partitioning scheme where more partitions still show a performance advantage (any db system). Looks like in your case a partitioning scheme with 256 partitions on one of the 2 dimensions sounds reasonable. Or 32 instead of 256 bins for each dim, if your searches are bidirectional. Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 2D partitioning of VLDB - sane or not?
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes: I'd say that that point currently is well below 2000 partitions for all common db systems. I think it will depend heavily on the type of queries you're talking about. Postgres's constraint_exclusion is a linear search and does quite a bit of work for each constraint. So it isn't terribly efficient for more than 1,000 partitions or so. *But* that only affects planning time. If your queries are always effectively pruned to few partitions and you execute them thousands of times then you not care about slow planning time. And if the overall table is large enough and you're dropping and loading partitions then you may still be benefiting from partitioning by keeping all the loaded records together and allowing dropping a partition to be constant time. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Testing the async-commit patch
Tom Lane [EMAIL PROTECTED] writes: But to get to the point: the urgency of testing the patch more extensively has just moved up a full order of magnitude, The problem testing this patch is that the window for a committed transaction to not be synced is quite narrow, especially for the regression tests. For testing purposes I wonder if there are ways we can widen this window. Some ideas, some wackier than others, are: . Raise the default wal_writer_delay to 5s or so -- also temporary until release . Add an ifdef USE_ASSERT_CHECKING which randomly omits setting hint bits even when it could. . add an ifdef USE_ASSERT_CHECKING which randomly fails to update the LSN when syncing WAL so that even after a buffer flush we still can't set hint bits. Only the first one isn't really wacky, but perhaps there's something there. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] 8.3 freeze/release
On Mon, Aug 13, 2007 at 10:04:32AM -0400, Tom Lane wrote: Well, we are in feature freeze, but as far as I'm concerned ecpg is its own little fiefdom. If you have enough confidence in these changes to apply them now, no one is going to question you. Okay, done. Please test it! There were some with platforms I didn't test on, but hopefully they are fixed now. There is one more problem with the threading tests that are run even if threading is not activated. Will look into this later. Michael -- Michael Meskes Email: Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: [EMAIL PROTECTED] Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL! ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Question about change in page/tuple header (v4)
I'm comparing now different version of page layer, and I have two questions: 1) We now store only low 16bits TLI, but name in structure stays same. Maybe pg_tli_lo could be better. 2) HASOID has been moved in infomask and original place is unused. Is there some reason for that? This change little bit complicate tuple header upgrade. If there is not real reason to have it in current place, I recommend to move it back. thanks Zdenek ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Testing the async-commit patch
Simon Riggs wrote: On Mon, 2007-08-13 at 16:27 -0400, Andrew Dunstan wrote: Let's say that this file looks just like a postgresql.conf file, except that any line beginning with '[identifier]' is a config set name for the lines that follow. So we might have: [asynch_commit] synchronous_commit = off [no_fsync] fsync = off [csvlogs] start_log_collector = true log_destination = 'stderr, csvlog' Then there would be an extra installcheck-parallel run for each set. If the file isn't there we do nothing. Sounds fine, though I'd prefer this in XML to allow further extensibility in the future. YAML? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Testing the async-commit patch
Alvaro Herrera [EMAIL PROTECTED] writes: Simon Riggs wrote: Sounds fine, though I'd prefer this in XML to allow further extensibility in the future. YAML? That would seem to require making pg_regress depend on some XML library or other, which is a dependency I'd rather not add. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOT patch, missing things
Pavan Deolasee [EMAIL PROTECTED] writes: I am trying to figure out the best way to extract this information. Is there any existing code to get all attributes used in the expressions ? Or do I need to walk the tree and extract that information ? There are a number of near matches in backend/optimizer/util/var.c, but nothing that has exactly the API you probably want, which I'd think would be to extract a bitmapset of the varattnos of level-zero Vars. contain_var_reference() could be used repeatedly but that seems tremendously inefficient. I'd suggest coding up some new function using what's there for reference. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] HOT patch, missing things
Pavan Deolasee [EMAIL PROTECTED] writes: What if we just track the amount of potentially dead space in the relation (somebody had suggested that earlier in the thread) ? Every committed UPDATE/DELETE and aborted UPDATE/INSERT would increment the dead space. Whenever page fragmentation is repaired, either during normal operation or during vacuum, the dead space is reduced by the amount of reclaimed space. Autovacuum triggers whenever the percentage of dead space increases beyond a threshold. Doesn't this design completely fail to take index bloat into account? Repairing heap fragmentation does not reduce the need for VACUUM to work on the indexes. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Testing the async-commit patch
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Simon Riggs wrote: Sounds fine, though I'd prefer this in XML to allow further extensibility in the future. YAML? That would seem to require making pg_regress depend on some XML library or other, which is a dependency I'd rather not add. Yeah, I think the way I set it out would work just fine for the intended purpose. XML, YAML, JSON et al are all well suited to tree structured data. But what we're describing here isn't tree structured. It is simply some named sets of postgresql.conf directives. As such, it would be best if it were as close as possible to actual postgresql.conf syntax. And I am also reluctant to add an additional dependency onto the buildfarm script. (I wasn't actually thinking of doing this throught pg_regress, but I'm open to persuasion). cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Alvaro Herrera [EMAIL PROTECTED] writes: Oleg Bartunov wrote: I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? I think Heikki's suggestion of having each configuration create a new type would effectively do the same thing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Question about change in page/tuple header (v4)
Zdenek Kotala [EMAIL PROTECTED] writes: 2) HASOID has been moved in infomask and original place is unused. Is there some reason for that? To keep it next to the other bits that are about tuple content rather than transactional behavior. This change little bit complicate tuple header upgrade. If there is not real reason to have it in current place, I recommend to move it back. There are enough other changes there that you're going to have to manipulate the infomask anyway. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] HOT patch, missing things
Tom Lane [EMAIL PROTECTED] writes: Pavan Deolasee [EMAIL PROTECTED] writes: What if we just track the amount of potentially dead space in the relation (somebody had suggested that earlier in the thread) ? Every committed UPDATE/DELETE and aborted UPDATE/INSERT would increment the dead space. Whenever page fragmentation is repaired, either during normal operation or during vacuum, the dead space is reduced by the amount of reclaimed space. Autovacuum triggers whenever the percentage of dead space increases beyond a threshold. Doesn't this design completely fail to take index bloat into account? Repairing heap fragmentation does not reduce the need for VACUUM to work on the indexes. Index bloat is a bit of an open issue already. Because page splits already prune any LP_DELETEd pointers any busy index keys will be pruned already. However any index keys which have not been the subject of an index lookup -- and that includes keys which are only accessed by bitmap-index-scans -- won't be pruned. So we don't really know how much bloat is currently in an index. Perhaps we need a new statistic which gets updated whenever a page split prunes LP_DELETEd pointers (or perhaps when LP_DELETE is set?). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] HOT patch, missing things
On Tue, 2007-08-14 at 10:10 -0400, Tom Lane wrote: Pavan Deolasee [EMAIL PROTECTED] writes: What if we just track the amount of potentially dead space in the relation (somebody had suggested that earlier in the thread) ? Every committed UPDATE/DELETE and aborted UPDATE/INSERT would increment the dead space. Whenever page fragmentation is repaired, either during normal operation or during vacuum, the dead space is reduced by the amount of reclaimed space. Autovacuum triggers whenever the percentage of dead space increases beyond a threshold. Doesn't this design completely fail to take index bloat into account? Repairing heap fragmentation does not reduce the need for VACUUM to work on the indexes. I thought of that, but we will only clean up space that is allowable, so the indexes don't degrade. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Testing the async-commit patch
Tom, We're getting some additional test infrastructre at Sun; I'll throw this on the pile of stuff to test. However, the current tests we're doing are regression tests and benchmark runs. If there's some other kind of testing we need to do, I'll need specifics. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [EMAIL PROTECTED]: Re: [GENERAL] array_to_set functions]
TODO item? --- Michael Glaesemann wrote: -- Start of PGP signed section. On Aug 8, 2007, at 12:18 , Decibel! wrote: On Wed, Aug 08, 2007 at 12:03:34PM -0500, Michael Glaesemann wrote: Personally, I think expandarray is more appropriate and its functionality probably more generally useful, as it identifies the array indices as well. Note you can also rename the columns. Sure. My point is that we should have a way to convert arrays to sets and back in the backend. Can't really argue with you there, as I find array_accum myself. (Though I'd still nit-pick that this isn't an array to set conversion, but rather array to--possibly single-column--table.) Michael Glaesemann grzm seespotcode net -- End of PGP section, PGP failed! -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] GUC for default heap fillfactor
Decibel! wrote: -- Start of PGP signed section. On Thu, Aug 09, 2007 at 09:57:48AM +0900, ITAGAKI Takahiro wrote: If HOT gets into 8.3, we might need a GUC to set database wide heap fillfactor to an appropriate value. I have no objection to do that, but we will need other default values soon, something like default_[heap|btree|hash|gist|gin]_fillfactor. Some of us might feel it is mess to add random guc variables. I think we'd want to handle indexes with a different mechanism, probably one that makes changes to pg_am. In any case, how important is it to do this before 8.3? We were supposed to release this month, after all. Agreed. This seems like 8.4 material once we have HOT usage in the field. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] change name of redirect_stderr?
Andrew Dunstan wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Before I wrap up the CSVlog stuff, we need to decide whether or not to change the name of the redirect_stderr setting, and if so to what. The reason is that with CSVlogs it will no longer apply just to stderr (we will require it to be on for CSVlogs, in fact). I suggest redirect_logs, although it's arguably too general as it doesn't apply to syslog/eventlog. Perhaps it should be named analogously to stats_start_collector, ie think of the syslogger process as a log collector. I don't much like log_start_collector though --- start_log_collector seems far less confusing as to where the verb is. Nobody else seems to care much. I'll go with start_log_collector. Are we trying to use log_* prefixes, e.g. log_start_collector? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [EMAIL PROTECTED]: Re: [GENERAL] array_to_set functions]
2007/8/14, Bruce Momjian [EMAIL PROTECTED]: TODO item? + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend I am against. It's too simple do it in SQL language. Regards Pavel Stehule ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
Bruce Momjian wrote: Andrew Dunstan wrote: I suggest redirect_logs, although it's arguably too general as it doesn't apply to syslog/eventlog. Perhaps it should be named analogously to stats_start_collector, ie think of the syslogger process as a log collector. I don't much like log_start_collector though --- start_log_collector seems far less confusing as to where the verb is. Nobody else seems to care much. I'll go with start_log_collector. Are we trying to use log_* prefixes, e.g. log_start_collector? That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ How amazing is that? I call it a night and come back to find that a bug has been identified and patched while I sleep.(Robert Davidson) http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
Alvaro Herrera wrote: Bruce Momjian wrote: Andrew Dunstan wrote: I suggest redirect_logs, although it's arguably too general as it doesn't apply to syslog/eventlog. Perhaps it should be named analogously to stats_start_collector, ie think of the syslogger process as a log collector. I don't much like log_start_collector though --- start_log_collector seems far less confusing as to where the verb is. Nobody else seems to care much. I'll go with start_log_collector. Are we trying to use log_* prefixes, e.g. log_start_collector? That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. Yea, good point. I was just wondering because I don't see 'start' used in anywhere at the beginning of a GUC variable. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Testing the async-commit patch
Gregory Stark [EMAIL PROTECTED] writes: The problem testing this patch is that the window for a committed transaction to not be synced is quite narrow, especially for the regression tests. For testing purposes I wonder if there are ways we can widen this window. Some ideas, some wackier than others, are: . Raise the default wal_writer_delay to 5s or so -- also temporary until release I ran 100+ cycles of the parallel regression tests with this setting, and didn't see any failures. . Add an ifdef USE_ASSERT_CHECKING which randomly omits setting hint bits even when it could. I think this is better done by code inspection, ie, look for places that assume HEAP_XMIN/XMAX_COMMITTED is or can be set. I made a pass over CVS HEAD and found some apparent trouble spots: heapam.c lines 1843-1852 presume previous xmax can be hinted immediately, ditto lines 2167-2176, ditto lines 2716-2725. I think probably we should just remove those lines --- they are only trying to save work for future tqual.c calls. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Testing the async-commit patch
On Tue, 2007-08-14 at 12:09 -0400, Tom Lane wrote: I think this is better done by code inspection, ie, look for places that assume HEAP_XMIN/XMAX_COMMITTED is or can be set. I made a pass over CVS HEAD and found some apparent trouble spots: heapam.c lines 1843-1852 presume previous xmax can be hinted immediately, ditto lines 2167-2176, ditto lines 2716-2725. I think probably we should just remove those lines --- they are only trying to save work for future tqual.c calls. I'll check those out later tonight. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] change name of redirect_stderr?
Bruce Momjian wrote: Alvaro Herrera wrote: Bruce Momjian wrote: Andrew Dunstan wrote: I suggest redirect_logs, although it's arguably too general as it doesn't apply to syslog/eventlog. Perhaps it should be named analogously to stats_start_collector, ie think of the syslogger process as a log collector. I don't much like log_start_collector though --- start_log_collector seems far less confusing as to where the verb is. Nobody else seems to care much. I'll go with start_log_collector. Are we trying to use log_* prefixes, e.g. log_start_collector? That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. Yea, good point. I was just wondering because I don't see 'start' used in anywhere at the beginning of a GUC variable. Good point too. In other places we just name the feature that's to be started, for example we don't use start_autovacuum. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ Granting software the freedom to evolve guarantees only different results, not better ones. (Zygo Blaxell) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Testing the async-commit patch
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2007-08-14 at 12:09 -0400, Tom Lane wrote: heapam.c lines 1843-1852 presume previous xmax can be hinted immediately, ditto lines 2167-2176, ditto lines 2716-2725. I think probably we should just remove those lines --- they are only trying to save work for future tqual.c calls. I'll check those out later tonight. [ looks closer ] Actually, we can't just dike out those code sections, because the immediately following code assumes that XMAX_INVALID is correct. So I guess we need to export HeapTupleSetHintBits from tqual.c and do the full pushup in these places. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] change name of redirect_stderr?
Alvaro Herrera wrote: That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. Yea, good point. I was just wondering because I don't see 'start' used in anywhere at the beginning of a GUC variable. Good point too. In other places we just name the feature that's to be started, for example we don't use start_autovacuum. How about just log_collector then? Lets decide ASAP please - I want to get this committed. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Problem with locks
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: Seems to me this proves nothing much, since it doesn't use the same SysV semaphore API PG does. I was trying to copy the semaphore API exactly assuming USE_NAMED_POSIX_SEMAPHORES was *not* defined. According to the comments we prefer not to use named semaphores if possible. What you seem to have copied is the posix_sema.c code, which AFAIK is only used on Darwin. sysv_sema.c is what to look at ... unless your benchmark machine is a Mac. I switched the code over to the sysv_sema style api. It's gotten a bit grotty and I would clean it up if it weren't a temporary test program. If we find a real problem perhaps I should add a test case like this to the smoke test in ipc_test.c so people can check their OS. I did add something like the setitimer deadlock timeout to detect a process stuck waiting. There is a race condition there if a process is woken up just as the timer fires but if the timeout is large enough the chances of that are pretty remote. Judging by the first thread the whole loop excluding the usleep takes about 3ms. I've been using a timeout of 10 seconds. As such: $ ./a.out 40 900 10 running with 40 processes for 900s with timeout of 1ms telling threads to exit run done cleaning up semaphores and shared memory #include stdlib.h #include stdio.h #include string.h #include errno.h #include unistd.h #include signal.h #include sys/types.h #include sys/ipc.h #include sys/shm.h #include sys/sem.h #include sys/time.h union semun { int val;/* Value for SETVAL */ struct semid_ds *buf;/* Buffer for IPC_STAT, IPC_SET */ unsigned short *array; /* Array for GETALL, SETALL */ struct seminfo *__buf; /* Buffer for IPC_INFO (Linux specific) */ }; #define SEMAS_PER_SET 16 #define IPCProtection (0600) /* access/modify by user only */ #define PGSemaMagic 537 /* must be less than SEMVMX */ #define SEMAS_KEY_START (5431*1000-1) int nthreads, timeout, shmid; volatile unsigned char *wakers; typedef struct PGSemaphoreData { int semId; /* semaphore set identifier */ int semNum; /* semaphore number within set */ } PGSemaphoreData; PGSemaphoreData *sems; static void atexit_handler(); static void worker(int n); static void down(int n); static void up(int n); #define MAX_THREADS 250 #define WAKER_NOOP 253 #define WAKER_EXIT 254 #define WAKER_RUNNABLE 255 /* this just forces the atexit handler to be called */ static void handle_sig(int arg) {exit(127+arg);} int main(int argc, char *argv[]) { int i, semKey, runtime; pid_t *pids; struct sigaction act, oact; int semId=-1; if (argc = 1) nthreads = 10; else nthreads = atoi(argv[1]); if (nthreads = 0 || nthreads MAX_THREADS) { fprintf(stderr, usage: nthreads not between 1 and %d\n, MAX_THREADS); exit(1); } if (argc = 2) runtime = 10; else runtime = atoi(argv[2]); if (runtime 1) { fprintf(stderr, usage: runtime shorter than 1s\n); exit(1); } if (argc = 3) timeout = 1000*60; else timeout = 1000.0*atof(argv[3]); if (timeout 1) { fprintf(stderr, usage: timeout shorter than 1s\n); exit(1); } printf(running with %d processes for %ds with timeout of %dms\n, nthreads, runtime, timeout); sems = malloc(sizeof(*sems)*nthreads); semKey = SEMAS_KEY_START; for (i=0;inthreads;i++) { union semun semun; int semNum = i % SEMAS_PER_SET; if (semNum == 0) { semKey += 1; semId = semget(semKey, SEMAS_PER_SET, IPC_CREAT | IPC_EXCL | IPCProtection); if (semId 0) { perror(semget); exit(1); } } semun.val = 0; if (semctl(semId, semNum, SETVAL, semun) 0) { fprintf(stderr, semctl(%d, %d, SETVAL, 0): %s\n, semId, semNum, strerror(errno)); exit(1); } if (semId0 || semNum SEMAS_PER_SET) exit(1); sems[i].semId = semId; sems[i].semNum= semNum; } shmid = shmget(IPC_PRIVATE, nthreads*sizeof(unsigned char), IPC_CREAT | IPC_EXCL | IPCProtection); if (shmid == -1) { perror(shmget); exit(1); } wakers = shmat(shmid, NULL, 0); wakers[0] = WAKER_NOOP; for (i=1;inthreads;i++) wakers[i] = WAKER_RUNNABLE; pids = malloc(sizeof(pid_t)*nthreads); for (i=0;inthreads;i++) { /*printf(forking thread %d\n, i);*/ switch(pids[i] = fork()) { case 0: worker(i); exit(0); case -1: perror(fork); exit(1); default: /*printf(successfully forked thread %d as pid %d\n, i, pids[i]);*/ break; } } act.sa_handler = handle_sig; sigemptyset(act.sa_mask); act.sa_flags = 0; if (sigaction(SIGINT, act, oact) 0) perror(sigaction); atexit(atexit_handler); sleep(runtime); printf(telling threads to exit\n); for (i=0;inthreads;i++) { while (wakers[i] == WAKER_RUNNABLE) { printf(still waiting for thread %d to block\n, i); sleep(1); }
Re: [HACKERS] change name of redirect_stderr?
Andrew Dunstan wrote: Alvaro Herrera wrote: That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. Yea, good point. I was just wondering because I don't see 'start' used in anywhere at the beginning of a GUC variable. Good point too. In other places we just name the feature that's to be started, for example we don't use start_autovacuum. How about just log_collector then? Lets decide ASAP please - I want to get this committed. Works for me, or enable_log_collector? Based on Alvaro's comments, log_collector does sound like we are logging collector activity. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] change name of redirect_stderr?
On Tue, Aug 14, 2007 at 11:42 AM, in message [EMAIL PROTECTED], Andrew Dunstan [EMAIL PROTECTED] wrote: Alvaro Herrera wrote: In other places we just name the feature that's to be started, for example we don't use start_autovacuum. How about just log_collector then? +1 Unambiguous and consistent with other settings. log_collector = on ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] change name of redirect_stderr?
Bruce Momjian wrote: Andrew Dunstan wrote: Alvaro Herrera wrote: That sounds like you want to log when the collector starts, which is not the case and is confusing -- what collector is it talking about? This is about starting the log collector. Yea, good point. I was just wondering because I don't see 'start' used in anywhere at the beginning of a GUC variable. Good point too. In other places we just name the feature that's to be started, for example we don't use start_autovacuum. How about just log_collector then? Lets decide ASAP please - I want to get this committed. Works for me, or enable_log_collector? Based on Alvaro's comments, log_collector does sound like we are logging collector activity. The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [EMAIL PROTECTED]: Re: [GENERAL] array_to_set functions]
Bruce Momjian wrote: TODO item? Probably. See SQL2003 UNNEST: collection derived table ::= UNNEST left paren collection value expression right paren [ WITH ORDINALITY ] collection value expression ::= array value expression | multiset value expression Joe ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] change name of redirect_stderr?
The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. I vote against. Remember that some people look at the GUCs in alpha order from pg_settings. As such, if we're changing GUC names it ought to be in a way that groups logically if we can. log_collector_enable or log_collector_start or even log_redirect. But something with log_* -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
On 8/13/07, Bruce Momjian [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. I've been watching this thread with a mixture of dread and hope, waiting to see where the developers' inclination will end up; whether leaving a useful foot gun available will be allowed. This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. Now, so this email isn't entirely complaining, and as a data point for the discussion, I'll explain why I do not want to see tsearch2 crippled in the way suggested by Heikki and Bruce. My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. Thanks, folks, for all the work on this so far! --miker ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
Josh Berkus [EMAIL PROTECTED] writes: The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. log_collector_enable or log_collector_start or even log_redirect. But something with log_* I'm voting with Alvaro on this. All of your suggestions are confusing because log looks like the verb, which it is not. Specifically, they sound like what the switch does is to cause a log message to be emitted about some action that would occur anyway. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a real application myself. My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Don't you need to use the right configuration to parse the query into a tsquery as well? What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Imagine an international online auction system, where you can put items for sale and specify a minimum price. Part of the database schema is: CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC ); It clearly doesn't make sense to directly compare prices in different currencies against each other. A query like WHERE price 1000 doesn't make sense, unless you also restrict the currency, or use an exchange rate to convert between currencies. Also, different currencies might have slightly different rules on how many decimal places are significant. In this example, as well as your bibliographic scenario, we can conveniently store prices in all currencies in the same field because they're all numerics. If we wanted to enforce per-currency rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires (which doesn't really exists anymore, I know), we'd need to store them in separate columns. And before the decimalisation in 1971, the British pound was divided into 20 shillings, each of which was divided to 12 pence, so it wouldn't have fit to normal numeric field, and we would again have to store that in a separate column. What I'm trying to point out is that the problem isn't unique to text search. It's an old problem, with many alternative solutions, even with strong typing. Arguably if you store data in multiple languages in same field, you have a denormalized schema. Granted, loose typing can be more convenient, but you give up the benefits of strong typing as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. That would be unfortunate :(. Sorry I haven't looked at this earlier. Do you think that implementing a strongly typed system is too much work for 8.3? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Testing the async-commit patch
On Mon, 2007-08-13 at 16:27 -0400, Andrew Dunstan wrote: [asynch_commit] synchronous_commit = off [no_fsync] fsync = off This is the Windows INI file format. As such, it's easy to find code samples in almost any language that parse this format for you. For example, Python has a core library called ConfigParser that will read these in, and somewhere around here I even have some UNIX shell code that parses it. There's already a PostgreSQL-related project using this format--even on UNIX systems the odbc.ini config files look like this. I already have a program that generates multiple postgresql.conf files using this format around here for exactly this sort of test (compare benchmark results with two different configurations), just never had a reason to package it up for anybody else to use. It's trivial code if you use the Python parser. On Tue, 14 Aug 2007, Simon Riggs wrote: Sounds fine, though I'd prefer this in XML to allow further extensibility in the future. Putting this in XML significantly raises the bar for how complicated tools that work on these files must be, with the implicit dependencies that go with that. And as Andrew already pointed out, there is very little tree-structure to this data that justifies the extra complexity. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
Josh Berkus wrote: The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. I vote against. Remember that some people look at the GUCs in alpha order from pg_settings. As such, if we're changing GUC names it ought to be in a way that groups logically if we can. Should we change the ordering of pg_settings? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] tsearch2 in PostgreSQL 8.3?
A lot of work has been done to try to get /contrib/tsearch2 into the core backend for 8.3, but we have hit a roadblock in how to handle multiple text search configurations. (FYI, the documentation is at http://momjian.us/expire/textsearch/HTML/textsearch.html.) There are three options for controlling text search configurations: 1) have a GUC variable which specifies the default configuration 2) require the configuration to be always specified 3) use the type system to automatically use the right configuration The problem with #1 is that is it error-prone (easy to mismatch configurations). One idea was to have the GUC be super-user-only but then restoring a dump as non-super-user is a problem. The problem with #2 is that it makes implicit and explicit casting impossible (there is no place to specify the configuration). #3 requires more code and is probably not something we want to do at this stage in 8.3 development. It requires passing typmod values between functions and operators (not something we have done easily in the past). Given this, should we decide to not include full text search in 8.3? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] change name of redirect_stderr?
On Aug 14, 2007, at 12:40 , Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. log_collector_enable or log_collector_start or even log_redirect. But something with log_* I'm voting with Alvaro on this. All of your suggestions are confusing because log looks like the verb, which it is not. Specifically, they sound like what the switch does is to cause a log message to be emitted about some action that would occur anyway. AIUI, if the-GUC-yet-to-be-named is not enabled, no logging is done at all: messages are just sent to stderr. Why something simple like enable_logging or start_logger? Michael Glaesemann grzm seespotcode net ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian wrote: A lot of work has been done to try to get /contrib/tsearch2 into the core backend for 8.3, but we have hit a roadblock in how to handle multiple text search configurations. (FYI, the documentation is at http://momjian.us/expire/textsearch/HTML/textsearch.html.) There are three options for controlling text search configurations: 1) have a GUC variable which specifies the default configuration 2) require the configuration to be always specified 3) use the type system to automatically use the right configuration The problem with #1 is that is it error-prone (easy to mismatch configurations). One idea was to have the GUC be super-user-only but then restoring a dump as non-super-user is a problem. What is the worst consequence of mismatching configuration? Does it cause a system crash? A backend hang? A corrupted index? Lost data? Or does it, as I assume, just fail to return the exact result set that would be returned if the correct configuration was supplied? If the answer is the latter, I think this is not so huge a problem that FULL TEXT should be rejected for 8.3 on these grounds. We just tell people to use the correct query and be done with it. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 La vida es para el que se aventura ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian wrote: #3 requires more code and is probably not something we want to do at this stage in 8.3 development. It requires passing typmod values between functions and operators (not something we have done easily in the past). It does? I was thinking of implicitly creating a new type, with no typmod, when you create a new configuration. Similar to enums, I think. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
Michael Glaesemann [EMAIL PROTECTED] writes: AIUI, if the-GUC-yet-to-be-named is not enabled, no logging is done at all: messages are just sent to stderr. Why something simple like enable_logging or start_logger? Um, that's still logging by my definition. I could live with start_logger, since that is not the same as logging. It could be that if we want real consistency we're going to have to make more changes than this one. Consider something like this: * All variables that cause a specific kind of log message to be printed or not shall be named log_something. (So log_ is a verb.) * Variables that affect the logging mechanism as a whole shall be named logging_something. For example, log_line_prefix is misnamed under this rule, and ought to be logging_line_prefix. Similarly, redirect_stderr would become logging_something --- I'd prefer logging_start_collector but could live with logging_collector (or maybe logging_use_collector?) Looking at the docs, there are a whole bunch of GUCs that would have to be renamed to meet this rule, but I think it would become clearer what does what. Is that too radical? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Alvaro Herrera wrote: What is the worst consequence of mismatching configuration? Does it cause a system crash? A backend hang? A corrupted index? Lost data? Or does it, as I assume, just fail to return the exact result set that would be returned if the correct configuration was supplied? Your assumption is correct. You can mismatch configurations not just by querying in a wrong configuration, but also by accidentally storing tsvectors generated with different configurations in the same column (with no additional column like Mike Rylander had to tell them apart), but it's still going to look OK from PostgreSQL's point of view. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
On Tue, Aug 14, 2007 at 03:15:44PM -0400, Alvaro Herrera wrote: Bruce Momjian wrote: A lot of work has been done to try to get /contrib/tsearch2 into the core backend for 8.3, but we have hit a roadblock in how to handle multiple text search configurations. (FYI, the documentation is at http://momjian.us/expire/textsearch/HTML/textsearch.html.) There are three options for controlling text search configurations: 1) have a GUC variable which specifies the default configuration 2) require the configuration to be always specified 3) use the type system to automatically use the right configuration The problem with #1 is that is it error-prone (easy to mismatch configurations). One idea was to have the GUC be super-user-only but then restoring a dump as non-super-user is a problem. What is the worst consequence of mismatching configuration? Does it cause a system crash? A backend hang? A corrupted index? Lost data? Or does it, as I assume, just fail to return the exact result set that would be returned if the correct configuration was supplied? If the answer is the latter, I think this is not so huge a problem that FULL TEXT should be rejected for 8.3 on these grounds. We just tell people to use the correct query and be done with it. I think that users of the full-text option would need to read the documentation and we could include any needed caveats. This would certainly do for the 8.3 release and would give us time to simplify the management and use in the 8.4 release. I, for one, have been waiting a long time for it to be integrated into the database. Ken ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] change name of redirect_stderr?
Heikki Linnakangas wrote: Josh Berkus wrote: The problem here is that log seems to be a verb in log_collector which is what makes it confusing. So we need another verb to make it clear that log is not one. This is not a problem with autovacuum because that one cannot be confused with a verb. start_log_collector still gets my vote. I vote against. Remember that some people look at the GUCs in alpha order from pg_settings. As such, if we're changing GUC names it ought to be in a way that groups logically if we can. Should we change the ordering of pg_settings? Yeah, this is not a good reason for deciding a naming issue. If we really want thematic collection of GUC vars then we should arrange for some sort of hierarchical naming, and appropriate sectioning of the config file. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] change name of redirect_stderr?
On 8/15/07, Tom Lane [EMAIL PROTECTED] wrote: For example, log_line_prefix is misnamed under this rule, and ought to be logging_line_prefix. Similarly, redirect_stderr would become logging_something --- I'd prefer logging_start_collector but could live with logging_collector (or maybe logging_use_collector?) The consistent prefix idea sounds good; does logging_enable jive with your proposal? Introduction of the term collector might be an overthink, and could confuse people. Your average postgres user tweaking his config file is going to see that and wonder what on earth a log collector is. Whereas it's generally understood that to log is to capture output and make it persistent, and logging_enable is clearly a setting that allows this to take place. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Heikki Linnakangas wrote: Bruce Momjian wrote: #3 requires more code and is probably not something we want to do at this stage in 8.3 development. It requires passing typmod values between functions and operators (not something we have done easily in the past). It does? I was thinking of implicitly creating a new type, with no typmod, when you create a new configuration. Similar to enums, I think. So each new configuration is a new data type? How do the tsearch functions handle these new data types? I also question if this can be completed soon. I have seen no specification yet, let alone someone coding it. Basically, the default GUC doesn't work because of: error prone if super-user only, non-super-user doesn't work on restore if non-super-user, can cause mismatch (perhaps this is the best option), and restore still a problem (no storage of config in indexes or tables) No one seems to like the always-specify the configuration (loses cast ability). And I don't see the code for new type appearing anytime soon. Bottom line --- we better figure out something quick or it isn't going to be in 8.3, and at this point, I am starting to doubt a solution will magically appear. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] change name of redirect_stderr?
With so many people trying to paint this particular bikeshed, my suggestion to Andrew is to commit the patch as is and leave the rename for a later patch. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J Jude: I wish humans laid eggs Ringlord: Why would you want humans to lay eggs? Jude: So I can eat them ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Mike Rylander wrote: [snip] Don't you need to use the right configuration to parse the query into a tsquery as well? Only if the user (or user agent) can supply enough information to move away from the configured default of, say, en-US. And even then, other tsvector configurations need to be searched. Configurations are not, of course, just about language/stemming, but also stop-words and thesauri. If it were just languages it wouldn't be too big of a deal (other than in-app DDL management...). What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. Sorry, communication problem here ... I provided an oversimplified example meant more to show the issues than to find alternate solutions, though I appreciate you taking the time to consider the problem. If I only had to consider one delimiting facet per record then it would be much simpler. But with the complication that some fields have stop-word lists (and some not), some use thesauri (and different ones, at that), and these fields (as extracted from the records) and their configurations (stem? use a stop-word list? use thesaurus x, y or z?) are user-defined ... As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Unfortunately, the problem you describe is not quite the same as the problem I need to solve. Fortunately, the current incarnation of tsearch2 does a pretty good job of solving the problem if you store everything in one place and use the query-time environment to apply some adjustments to the ranking of items. I could still work around this problem by creating inherited tables, one for each configuration on each index-providing table but I /really/ hope to avoid that. Creating new configurations for local requirements doesn't require creating new tables (and the associated management overhead in the app) today, something I'd really like to avoid. In fact, I'm starting to sweat just thinking about what the planner would go through with the number tables needed for the potential configurations in an installation that makes use of multiple thesauri and a mix of stop-word lists across, say, 30 languages. Such a dataset is not uncommon. In any case, thanks again for taking the time to think about the problem. I still think having the ability to store any old tsvector I happen to have hanging around in any column of the correct type is a GoodThing(tm). I see from Oleg's message down-thread that that's the way things will be (the tsvector type doesn't know about columns, just lexem tuples). --miker ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. That would be unfortunate :(. Sorry I haven't looked at this earlier. Do you think that implementing a strongly typed system is too much work for 8.3? Yea, probably. See my other posting with an updated subject line. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] change name of redirect_stderr?
Brendan Jurd [EMAIL PROTECTED] writes: On 8/15/07, Tom Lane [EMAIL PROTECTED] wrote: For example, log_line_prefix is misnamed under this rule, and ought to be logging_line_prefix. Similarly, redirect_stderr would become logging_something --- I'd prefer logging_start_collector but could live with logging_collector (or maybe logging_use_collector?) The consistent prefix idea sounds good; does logging_enable jive with your proposal? I dislike it. I claim that logging to plain stderr (without the syslogger process) is still logging. Logging to syslog (which also doen't need the syslogger process) is *definitely* logging. Something named logging_enable would suggest to the normal person that without it turned on, you'll get *nothing*. I'm not wedded to collector per se, but you really cannot escape the fact that there is one more concept in here than you wish to admit. I think that reflecting the existence of a collector process in the GUC names makes things clearer, not less clear. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian wrote: Basically, the default GUC doesn't work because of: error prone if super-user only, non-super-user doesn't work on restore if non-super-user, can cause mismatch (perhaps this is the best option), and restore still a problem (no storage of config in indexes or tables) I haven't really seen anyone else arguing about this. I wonder whether you are being overly zealous about it. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem. (Tom Lane) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. What has really hurt the default GUC idea is how to do restores from a pg_dump. How do you make sure the right default is used on a restore, particularly if multiple objects are being restored, and each has a different default GUC. I suppose your trigger handles that but that isn't going to help with an expression index, nor in cases where the default of the old database is different from the new one. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Alvaro Herrera wrote: Bruce Momjian wrote: Basically, the default GUC doesn't work because of: error prone if super-user only, non-super-user doesn't work on restore if non-super-user, can cause mismatch (perhaps this is the best option), and restore still a problem (no storage of config in indexes or tables) I haven't really seen anyone else arguing about this. I wonder whether you are being overly zealous about it. Uh, OK, but no one has told me how a database restore without a configuration name would work, so I am all ears. CREATE INDEX ii on x(to_tsvector(col)) There is nothing that says what configuration that index should use except the default setting, and if that is different in the restore database, you have problems. Same for a trigger that calls to_tsvector and is restored into a different database. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
On 8/14/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Bruce Momjian wrote: Basically, the default GUC doesn't work because of: error prone if super-user only, non-super-user doesn't work on restore if non-super-user, can cause mismatch (perhaps this is the best option), and restore still a problem (no storage of config in indexes or tables) I haven't really seen anyone else arguing about this. I wonder whether you are being overly zealous about it. I hate to just pile on Bruce, but as a production user I tend to agree... let me shoot myself in the foot. :) --miker ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] change name of redirect_stderr?
On 8/15/07, Tom Lane [EMAIL PROTECTED] wrote: Brendan Jurd [EMAIL PROTECTED] writes: The consistent prefix idea sounds good; does logging_enable jive with your proposal? I dislike it. I claim that logging to plain stderr (without the syslogger process) is still logging. Logging to syslog (which also doen't need the syslogger process) is *definitely* logging. Something named logging_enable would suggest to the normal person that without it turned on, you'll get *nothing*. I'm not wedded to collector per se, but you really cannot escape the fact that there is one more concept in here than you wish to admit. I think that reflecting the existence of a collector process in the GUC names makes things clearer, not less clear. Fair enough. I just took a fresh look at postmaster.conf, and indeed the logging variables are more complex than I gave them credit for with logging_enable. Retracted. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [EMAIL PROTECTED]: Re: [GENERAL] array_to_set functions]
On Tue, Aug 14, 2007 at 05:38:33PM +0200, Pavel Stehule wrote: 2007/8/14, Bruce Momjian [EMAIL PROTECTED]: TODO item? + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend I am against. It's too simple do it in SQL language. Why make everyone who works with arrays create a function just to do this? Something that's of use to common users should be included, simple or not. -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgphs3V40sOor.pgp Description: PGP signature
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian escribió: Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. What has really hurt the default GUC idea is how to do restores from a pg_dump. How do you make sure the right default is used on a restore, particularly if multiple objects are being restored, and each has a different default GUC. I suppose your trigger handles that but that isn't going to help with an expression index, nor in cases where the default of the old database is different from the new one. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. Of course, in order for this to work, the index itself must know what value was used on creation. Oleg already dismissed my suggestion of putting it into the index itself (a metapage or something). Maybe store it in reloptions? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can store rows into an index using more than one configuration, and it will work? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Index Tuple Compression Approach?
I recently had a chat with someone who was pretty intimate with Adabas for a number of years who's in the process of figuring things out about PostgreSQL. We poked at bits of the respective implementations, seeing some similarities and differences. He pointed out one aspect of index handling that could (in principle) be an interesting optimization. Evidently, in Adabas, index leaf nodes were not simply tuples, but lists where the index value would not be repeated. In PostgreSQL, if you have the index value 'abc', and there are 10 tuples with that value, then you'll have a page full of tuples of the following form: |abc|ptr[rec1]|abc|ptr[rec2]|abc|ptr[rec3]| ...and so forth... Now, the Adabas approach was rather different. It would only have the index value once, and then have the list of tuple pointers: |abc|ptr[rec1],ptr[rec2],ptr[rec3],...[ptr[rec10]| This could allow a fair bit of compression, for cases where the index value is not unique. There is a concommitant downside, that concurrent updates may fight over a page, and, since there would be a higher density, there would be more need to fight over pages. Does this seem pretty much like madness? Or is it a plausible some day ToDo? -- cbbrowne,@,acm.org http://linuxfinances.info/info/postgresql.html I don't do drugs anymore 'cause I find I get the same effect just by standing up really fast. -- Jonathan Katz ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Tuple Compression Approach?
Isn't this what Grouped Index Tuples is? On Tue, Aug 14, 2007 at 05:21:16PM -0400, Chris Browne wrote: I recently had a chat with someone who was pretty intimate with Adabas for a number of years who's in the process of figuring things out about PostgreSQL. We poked at bits of the respective implementations, seeing some similarities and differences. He pointed out one aspect of index handling that could (in principle) be an interesting optimization. Evidently, in Adabas, index leaf nodes were not simply tuples, but lists where the index value would not be repeated. In PostgreSQL, if you have the index value 'abc', and there are 10 tuples with that value, then you'll have a page full of tuples of the following form: |abc|ptr[rec1]|abc|ptr[rec2]|abc|ptr[rec3]| ...and so forth... Now, the Adabas approach was rather different. It would only have the index value once, and then have the list of tuple pointers: |abc|ptr[rec1],ptr[rec2],ptr[rec3],...[ptr[rec10]| This could allow a fair bit of compression, for cases where the index value is not unique. There is a concommitant downside, that concurrent updates may fight over a page, and, since there would be a higher density, there would be more need to fight over pages. Does this seem pretty much like madness? Or is it a plausible some day ToDo? -- cbbrowne,@,acm.org http://linuxfinances.info/info/postgresql.html I don't do drugs anymore 'cause I find I get the same effect just by standing up really fast. -- Jonathan Katz ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgppr2SjO8mDI.pgp Description: PGP signature
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Oleg Bartunov wrote: On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can store rows into an index using more than one configuration, and it will work? Can and does, to great success. :) --miker ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian [EMAIL PROTECTED] writes: Alvaro Herrera wrote: I haven't really seen anyone else arguing about this. I wonder whether you are being overly zealous about it. Uh, OK, but no one has told me how a database restore without a configuration name would work, so I am all ears. It's the dump/restore problem that really is the kicker here. I don't mind so much a GUC that only controls the interpretation of queries, but if it determines how an index or a tsvector column gets rebuilt, we've got problems. I've just finished re-reading the prior thread, and here are what seem to me to be the salient points: * Oleg, Teodor, and all of the old-line users of tsearch2 are comfortable with setting up a trigger to maintain a materialized tsvector column for a table. They argue that recomputing the tsvector (possibly more than once) during a query is more expensive than fetching it from disk. My suspicion is that the latter argument gets weaker every year --- CPUs are getting faster lots faster than disks are. * Bruce (and I ... not sure about anyone else) want to support usage of text search via a functional index. This is argued to be easier to set up (no fooling with triggers) and possibly faster depending on CPU vs I/O speeds. I don't think there is any desire here to eliminate the trigger approach, just to provide an alternative. * For *either* the trigger or functional-index approach, I think it is unacceptable to rely on a GUC variable to determine how the tsvector is derived from the raw-document fields for storage or indexing. It's just too error-prone, particularly when you consider dump-and-reload cases. What I think we should say is that the ts parsing configuration name can be either hardwired or taken from another field of the table. In the trigger case this would mean providing a couple of standard triggers, one taking the config name as a trigger parameter, and the other accepting a trigger parameter that's the name of the config name column. In the index case this would mean that the index expression has to be either to_tsvector('constant', ...) or to_tsvector(field, ...). Note that all four cases boil down to saying that stored or indexed tsvectors have to be derived from the two-parameter form of to_tsvector. * For queries, there is not anything very wrong with having a default configuration, but the trick is how to get the planner to match that up with an index that's written with the two-parameter form of to_tsvector. One hackish possibility is to define the single-parameter form of to_tsvector like this: create function to_tsvector(text) returns tsvector as $$ select to_tsvector(get_default_text_search_config(), $1) $$ language sql strict stable; where get_default_text_search_config() is essentially just current_setting('default_text_search_config') except it is misleadingly marked immutable. Then, a query with WHERE to_tsvector(document_col) @@ tsquery(...) will have the SQL function inlined, and the get_default_text_search_config() call const-folded, and suddenly it looks like WHERE to_tsvector('english', document_col) @@ tsquery(...) and can be matched to a functional index that's declared using the explicit 'english' configuration name. This is pretty grotty though ... can anyone think of a better way? (The main objection I can see to it is that someone could shoot himself in the foot by using this function instead of two-parameter to_tsvector in a custom trigger function. But hopefully, anyone writing a custom trigger function will have read the manual's warning not to do that. Note that I suggest marking the function stable so that it can't be misused that way in a functional index. Another possible objection is that get_default_text_search_config() is a foot-gun all by itself, since it could be used in a functional index. Aside from not documenting it I'm not sure there's much to be done about that.) * I'm not enamored of Heikki's idea about a datatype-based solution, because I don't think that it will allow columns containing tsvectors derived with different configurations. It's perfectly clear that advanced users want to be able to do that, and it's also clear that as long as the config name is coming from a stored column (or can be reconstructed somehow from the stored data) that it's perfectly well-defined. Since I don't think that a datatype solution is the way to go, I don't feel that we are as far away from an agreement as Bruce is worried about. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Mike Rylander wrote: On 8/14/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Bruce Momjian wrote: Basically, the default GUC doesn't work because of: error prone if super-user only, non-super-user doesn't work on restore if non-super-user, can cause mismatch (perhaps this is the best option), and restore still a problem (no storage of config in indexes or tables) I haven't really seen anyone else arguing about this. I wonder whether you are being overly zealous about it. I hate to just pile on Bruce, but as a production user I tend to agree... let me shoot myself in the foot. :) --miker Sure, we can document hazards, but the larger problem is related to the fact that the default controls what gets stored in the database. This is a similar problem to when we had an autocommit GUC which caused problems. Technically, this is like how the server encoding affects what is stored in the database. If we allowed users to change the server encoding in a database that already had data in it, there would be no way to identify which data was using the old encoding and which was using the new one. Now, the application might be able to identify them just fine, but a database restore would be unable to recreate the data the same way. If we want to keep the default GUC we would have to allow non-super-user changes so we can use it in pg_dump for restore, but even then if the default is different there is going to be a mix of old/new after the restore because table changes after the restore is going to use the new default config. Perhaps the best we could do is to tell people who change the default GUC that they are on their own in restoring the database, or they have to be very carful like with triggers to assign the configuration properly. I can imagine how complex that part of the documentation will be, but it is doable. The other point is that we should have a good idea of the API because if it gets into 8.3 it will be harder to change. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Alvaro Herrera [EMAIL PROTECTED] writes: Bruce Momjian escribió: What has really hurt the default GUC idea is how to do restores from a pg_dump. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. The nasty cases are in data-only dumps, that is, where you're trying to load data into a table with pre-existing indexes or triggers. A SET like the above is at least as likely to be wrong as right, if the index or trigger depends on it to tell it what to do. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] CVS corruption/mistagging?
Looking into recent buildfarm failures on the 7.4 branch: http://www.pgbuildfarm.org/cgi-bin/show_status.pl It looks like parts of the CVS repository have been mistagged as belonging to REL7_4_STABLE or have been corrupted somehow: http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/btree_gist/btree_bit.c?sortby=date;only_with_tag=REL7_4_STABLE I'm not sure what's going on here, but something has gone bad. Kris Jurka ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian [EMAIL PROTECTED] writes: The other point is that we should have a good idea of the API because if it gets into 8.3 it will be harder to change. Yeah, once it's in core we have a pretty strong backwards-compatibility restriction to deal with. Someone upthread claimed we can always simplify it later but that's exactly backward --- we can add features later, but we can't subtract. Maybe we should be looking to implement just the minimum set of features for 8.3 and leave some of the more controversial stuff for 8.4. I hate to admit it, but if we take that point of view then triggers are in and functional-index support is out. We have to support the trigger approach because it's what is in tsearch2 now, and the existing users will expect to continue to have that option. However, allowing the standard triggers to pay attention to a configuration GUC variable is simply broken; that bit has to go away. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CVS corruption/mistagging?
Kris Jurka [EMAIL PROTECTED] writes: It looks like parts of the CVS repository have been mistagged as belonging to REL7_4_STABLE or have been corrupted somehow: http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/btree_gist/btree_bit.c?sortby=date;only_with_tag=REL7_4_STABLE Hmm ... btree_bit.c shouldn't be in 7.4 at all. I can't help thinking this has something to do with the recent CVS server move. Magnus, any thoughts? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander [EMAIL PROTECTED] writes: My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. This is *extremely* useful. I think it's precisely what we've been missing so far. At least, what I've been missing. So the question is what exactly happens in this case? If I search for the does that mean it will ignore matches in English where that's a stop-word but find me books on tea in French? Is that what I should expect to happen? What if I search for earl and the? Does that find me French books on Early Grey Tea but English books on all earls? What happens if I use the same operator directly on the text column? Or perhaps it's not even possible to specify stop-words when operating on a text column? Should it be? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CVS corruption/mistagging?
I wrote: Kris Jurka [EMAIL PROTECTED] writes: It looks like parts of the CVS repository have been mistagged as belonging to REL7_4_STABLE or have been corrupted somehow: http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/btree_gist/btree_bit.c?sortby=date;only_with_tag=REL7_4_STABLE Hmm ... btree_bit.c shouldn't be in 7.4 at all. I did a fresh checkout of the 7.4 branch and diff'd against my local copy, and it seems clear that every file that was not in 7.4 at all has had its HEAD version tagged as REL7_4_STABLE. The files that did exist then are all right. That's throughout the whole tree, not just in contrib/btree_gist. I have no idea how you make CVS do that, but I'm sure there is some magic one-liner for it. Checking the files in contrib/btree_gist on the CVS server gives a pretty good fix on who changed it and when: ls -la total 466 drwxrwxr-x 6 scrappy dev 1024 Aug 14 22:30 . drwxrwxr-x 90 scrappy dev 2048 Aug 14 22:27 .. drwxrwxr-x 2 scrappy dev512 Apr 20 15:16 Attic -r--r--r-- 1 tgl dev 9555 Jun 26 22:05 Makefile,v -r--r--r-- 1 258 dev 5870 Apr 20 16:19 README.btree_gist,v -r--r--r-- 1 meskes dev 12052 Aug 12 09:50 btree_bit.c,v -r--r--r-- 1 meskes dev 9921 Aug 12 09:50 btree_bytea.c,v -r--r--r-- 1 meskes dev 10246 Aug 12 09:50 btree_cash.c,v -r--r--r-- 1 meskes dev 11433 Aug 12 09:50 btree_date.c,v -r--r--r-- 1 meskes dev 10230 Aug 12 09:50 btree_float4.c,v -r--r--r-- 1 meskes dev 10093 Aug 12 09:50 btree_float8.c,v -r--r--r-- 1 meskes dev 33935 Aug 12 09:50 btree_gist.c,v -r--r--r-- 1 258 dev 4325 Apr 20 15:16 btree_gist.h,v -r--r--r-- 1 258 dev 58744 Apr 20 16:19 btree_gist.sql.in,v -r--r--r-- 1 meskes dev 14736 Aug 12 09:50 btree_inet.c,v -r--r--r-- 1 meskes dev 10253 Aug 12 09:50 btree_int2.c,v -r--r--r-- 1 meskes dev 10240 Aug 12 09:50 btree_int4.c,v -r--r--r-- 1 meskes dev 10254 Aug 12 09:50 btree_int8.c,v -r--r--r-- 1 meskes dev 18111 Aug 12 09:50 btree_interval.c,v -r--r--r-- 1 meskes dev 12025 Aug 12 09:50 btree_macaddr.c,v -r--r--r-- 1 meskes dev 14671 Aug 12 09:50 btree_numeric.c,v -r--r--r-- 1 meskes dev 9796 Aug 12 09:50 btree_oid.c,v -r--r--r-- 1 meskes dev 18247 Aug 12 09:50 btree_text.c,v -r--r--r-- 1 meskes dev 26180 Aug 12 09:50 btree_time.c,v -r--r--r-- 1 258 dev 29712 Apr 20 15:16 btree_ts.c,v -r--r--r-- 1 meskes dev 17588 Aug 12 09:50 btree_utils_num.c,v -r--r--r-- 1 meskes dev 8959 Aug 12 09:50 btree_utils_num.h,v -r--r--r-- 1 meskes dev 48824 Aug 12 09:50 btree_utils_var.c,v -r--r--r-- 1 meskes dev 7099 Aug 12 09:50 btree_utils_var.h,v drwxrwxr-x 3 scrappy dev 1024 Aug 14 22:27 data drwxrwxr-x 3 scrappy dev 1024 Aug 14 22:27 expected drwxrwxr-x 3 scrappy dev 1024 Aug 14 22:27 sql -r--r--r-- 1 meskes dev 10021 Aug 12 09:50 uninstall_btree_gist.sql,v Michael, want to fess up to what you did? In the meantime, though, it's not quite clear why this would lead to a buildfarm failure --- it should just mean a lot of extraneous files appearing in a fresh checkout. (Looks a bit harder ... Oh, it looks like btree_gist has some files that used to be autogenerated and are now in CVS, so the bogusly new versions from CVS are suppressing the desired generation from the old btree_num.c file.) regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CVS corruption/mistagging?
Tom Lane wrote: In the meantime, though, it's not quite clear why this would lead to a buildfarm failure --- it should just mean a lot of extraneous files appearing in a fresh checkout. (Looks a bit harder ... Oh, it looks like btree_gist has some files that used to be autogenerated and are now in CVS, so the bogusly new versions from CVS are suppressing the desired generation from the old btree_num.c file.) Looking at the Committers mail, it looks like there have only been two very small commits since Michael's series of commits around 12 to 13.5 hours ago, and before that nothing since around 28 hours ago. Do we have a backup snapshot of the repo taken during that time that we can roll back to? That might be simpler than doing surgery on the repo. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index Tuple Compression Approach?
On Tue, 2007-08-14 at 16:27 -0500, Decibel! wrote: Isn't this what Grouped Index Tuples is? http://community.enterprisedb.com/git/git-readme.txt It looks like GIT is a little different. GIT actually stores a lower-bound key of a contiguous* range of keys that all point to the same page, and for each of those ranges stores a bitmap of page offsets. A search searches first for an exact match in the index, and failing that, looks to see if the previous index tuple happens to be one of these ranges. The algorithm Chris is talking about stores a set of tuple ids (which include page and offset) for each distinct key. Both could be helpful, although I don't think they can work together very well. GIT has the disadvantage that it's lossy. It doesn't even store every key in the index, so it can't be sure that the match actually is a match. Thus, it returns candidate matches. That also has implications for enforcing UNIQUE (although it's not impossible, according to the readme). However, GIT can be used effectively on an index that happens to be unique. GIT also assumes a tree structure, and makes no sense for a hash index, and makes no sense for a types without ordering. GIT's space savings is dependent on the clustering of the table. Chris's suggestion would work on a UNIQUE index, but would be no help at all, because there would be no duplicate keys to collapse. However, it could be used for non-tree indexes. The space savings for this strategy is dependent on how repetitive the keys are. I guess the ultimate deciding factor is which can save you more space. If you have lots of duplicates, Chris's suggestion might work better, because you don't have to try to maintain cluster order. If you have a wider distribution of data, GIT is probably better, although you have to keep some degree of clustering (HOT may help with that). Heikki is the authority on GIT, so I'm including him in the CC so he can correct me :) Regards, Jeff Davis *: I'm not 100% sure I'm using contiguous correctly, but the range of keys can contain gaps or duplicates, so long as every key in the range points to that same page. That is, if keys 1,1,2,3,5 all point to page P, they can be grouped into just 1 so long as there doesn't exist a key 4 that points to a page other than P. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CVS corruption/mistagging?
Andrew Dunstan [EMAIL PROTECTED] writes: Looking at the Committers mail, it looks like there have only been two very small commits since Michael's series of commits around 12 to 13.5 hours ago, and before that nothing since around 28 hours ago. Do we have a backup snapshot of the repo taken during that time that we can roll back to? That might be simpler than doing surgery on the repo. It looks to me like the mistake was unrelated to any commit, and was made at Aug 12 09:50 UTC. (Aside from the file-timestamp evidence, caracara's build failure at 2007-08-12 213845 shows that the problem is more than 2 days old.) I count seven commits of my own and several of Michael's since then. If I'm right that a simple cvs rtag will fix it, I'd rather do that than try to reapply patches. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CVS corruption/mistagging?
I wrote: I did a fresh checkout of the 7.4 branch and diff'd against my local copy, and it seems clear that every file that was not in 7.4 at all has had its HEAD version tagged as REL7_4_STABLE. The files that did exist then are all right. That's throughout the whole tree, not just in contrib/btree_gist. As for getting out of it: it looks like the misapplied tags are revision tags not branch tags, eg RCS file: /cvsroot/pgsql/contrib/btree_gist/btree_bit.c,v Working file: btree_bit.c head: 1.7 branch: locks: strict access list: symbolic names: REL7_4_STABLE: 1.7 --- wrong REL8_0_13: 1.2 REL8_1_9: 1.6 REL8_2_4: 1.6 REL8_0_12: 1.2 REL8_1_8: 1.6 REL8_2_3: 1.6 REL8_0_11: 1.2 REL8_1_7: 1.6 REL8_2_2: 1.6 REL8_0_10: 1.2 REL8_1_6: 1.6 REL8_2_1: 1.6 REL8_2_STABLE: 1.6.0.4 which I think means that there's a one-liner way out of it. The CVS manual quoth: To delete a tag, specify the `-d' option to either `cvs tag' or `cvs rtag'. For example: cvs rtag -d rel-0-4 tc deletes the non-branch tag `rel-0-4' from the module `tc'. In the event that branch tags are encountered within the repository with the given name, a warning message will be issued and the branch tag will not be deleted. If you are absolutely certain you know what you are doing, the `-B' option may be specified to allow deletion of branch tags. In that case, any non-branch tags encountered will trigger warnings and will not be deleted. So I think that cvs rtag -d REL7_4_STABLE pgsql will fix it. I'd like someone to double-check that though. Also maybe we should back up the repository first? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CVS corruption/mistagging?
Tom Lane wrote: So I think that cvs rtag -d REL7_4_STABLE pgsql will fix it. I'd like someone to double-check that though. I will test on a copy of my mirror. Also maybe we should back up the repository first? Amen. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CVS corruption/mistagging?
Andrew Dunstan wrote: Tom Lane wrote: So I think that cvs rtag -d REL7_4_STABLE pgsql will fix it. I'd like someone to double-check that though. I will test on a copy of my mirror. I copied the mirror, did a checkout from it, ran the command above in the checked out version, then remove my checkout and checked out REL7_4_STABLE and did a diff against my copy from the postgres repo that hasn't been updated for some time. There were no significant differences, and no extra files. So I'd say your strategy looks good - backup and remove the phony tag. I'd also say we should probably be logging tag commands in taginfo. Presumably we mere mortal committers should not be doing any tagging whatsoever, and tags should only be applied in preparation for releases. Anyway, if we actually logged it we'd catch anything untoward early rather than possibly days later. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] HOT pgbench results
On 8/14/07, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Heikki Linnakangas [EMAIL PROTECTED] wrote: I ran some CPU intensive pgbench tests on HOT. Results are not surprising, HOT makes practically no difference on the total transaction rate, but reduces the need to vacuum: unpatched HOT tps 3680 3790 WAL written(MB) 5386 4804 checkpoints 109 autovacuums 116 43 autoanalyzes139 60 I also ran pgbench with/without HOT using a bit different configurations (pgbench -s10 -c10 -t50). There were 10% performance win on HOT, although the test was CPU intensive and with FILLFACTOR=100%. I'm curious why I am seeing results so different from everybody else (I had almost double tps with HOT). Are you running fsync on/off? Any other changes to postgresql.conf? merlin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [EMAIL PROTECTED]: Re: [GENERAL] array_to_set functions]
On 8/14/07, Bruce Momjian [EMAIL PROTECTED] wrote: TODO item? I would say yes...array_accum is virtually an essential function when working with arrays and the suggested array_to_set (and it's built in cousin, _pg_expand_array) really should not be built around generate_series when a C function is faster and will scale much better. array_to_set, as suggested in SQL, is something only a relative expert with PostgreSQL could be expected to write. Thus could generate_series be relieved from providing the only core function for set returning functions in the documentation. IMO, this part of the documentation could use some expansion anyways :) merlin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Bruce Momjian escribi?: What has really hurt the default GUC idea is how to do restores from a pg_dump. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. The nasty cases are in data-only dumps, that is, where you're trying to load data into a table with pre-existing indexes or triggers. A SET like the above is at least as likely to be wrong as right, if the index or trigger depends on it to tell it what to do. Ouch. I had not even thought that far. FYI, yes, the default tsearch GUC controls operations per row _if_ you have triggers or expression indexes that rely on the default configuration GUC. If you have specified the configuration, there is no problem, and hence my conclusion that the default GUC is too error-prone. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] CVS corruption/mistagging?
Andrew Dunstan [EMAIL PROTECTED] writes: So I'd say your strategy looks good - backup and remove the phony tag. I'd also say we should probably be logging tag commands in taginfo. Presumably we mere mortal committers should not be doing any tagging whatsoever, and tags should only be applied in preparation for releases. +1 ... we should at least log such commands, and maybe disallow to anyone except Marc's pgsql account. Particularly since they don't get reported in pgsql-committers messages (or is that something we can/should change?) Meanwhile, is there anyone around who can either (1) tar up the repository directory tree as root, or (2) confirm that a tarball made by a non-root committer is sufficient? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] CVS corruption/mistagging?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Tuesday, August 14, 2007 22:20:16 -0400 Tom Lane [EMAIL PROTECTED] wrote: +1 ... we should at least log such commands, and maybe disallow to anyone except Marc's pgsql account. Particularly since they don't get reported in pgsql-committers messages (or is that something we can/should change?) Meanwhile, is there anyone around who can either (1) tar up the repository directory tree as root, or (2) confirm that a tarball made by a non-root committer is sufficient? I've just finished tar'ng up the cvs repository as root ... I believe one as a non-root would be sufficient also, since it *should* retain file ownership / permissions with the -p option ... cvs# ls -lt cvsroot.aug14_2007.tar.bz2 - -rw-r--r-- 1 root wheel 62555651 Aug 15 02:52 cvsroot.aug14_2007.tar.bz2 cvs# du -sk cvsroot 414406 cvsroot - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFGwmq14QvfyHIvDvMRAsyFAJ9AND3JTDmOMR48Hl3Fdse2lVCjSwCglsu1 Pxj1LPPPx311rpY2XFj+Kvo= =Gz6E -END PGP SIGNATURE- ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CVS corruption/mistagging?
Marc G. Fournier [EMAIL PROTECTED] writes: On Tuesday, August 14, 2007 22:20:16 -0400 Tom Lane [EMAIL PROTECTED] wrote: Meanwhile, is there anyone around who can either (1) tar up the repository directory tree as root, or (2) confirm that a tarball made by a non-root committer is sufficient? I've just finished tar'ng up the cvs repository as root ... Great --- launching cvs rtag command now. I believe one as a non-root would be sufficient also, since it *should* retain file ownership / permissions with the -p option ... That's what I thought too, but a little extra paranoia doesn't hurt. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Tom Lane wrote: * For queries, there is not anything very wrong with having a default configuration, but the trick is how to get the planner to match that up with an index that's written with the two-parameter form of to_tsvector. One hackish possibility is to define the single-parameter form of to_tsvector like this: create function to_tsvector(text) returns tsvector as $$ select to_tsvector(get_default_text_search_config(), $1) $$ language sql strict stable; where get_default_text_search_config() is essentially just current_setting('default_text_search_config') except it is misleadingly marked immutable. Then, a query with WHERE to_tsvector(document_col) @@ tsquery(...) will have the SQL function inlined, and the get_default_text_search_config() call const-folded, and suddenly it looks like WHERE to_tsvector('english', document_col) @@ tsquery(...) This is an interesting idea that would allow queries without a configuration to match an expression index. The only trick is to train users not to use such tricks on triggers or expression index, but only in queries. The idea of 'stable' helps to fix that for expression indexes, but not for trigger use, I assume. * I'm not enamored of Heikki's idea about a datatype-based solution, because I don't think that it will allow columns containing tsvectors derived with different configurations. It's perfectly clear that advanced users want to be able to do that, and it's also clear that as long as the config name is coming from a stored column (or can be reconstructed somehow from the stored data) that it's perfectly well-defined. Since I don't think that a datatype solution is the way to go, I don't feel that we are as far away from an agreement as Bruce is worried about. Well, from where I sit, there is one person saying give me the foot gun, and Heikki saying he wants a bullet-proof type system, and you and I are in the middle, so the big problem is I don't see a concensus forming, and we have been discussing this for a while. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Since I don't think that a datatype solution is the way to go, I don't feel that we are as far away from an agreement as Bruce is worried about. Well, from where I sit, there is one person saying give me the foot gun, and Heikki saying he wants a bullet-proof type system, and you and I are in the middle, so the big problem is I don't see a concensus forming, and we have been discussing this for a while. The people who actually use tsearch2 seem to all have the same opinion ... so I think we can't go too far in the bullet-proofing direction. But I would like a design that is bulletproof in dump/reload scenarios, and I think it's fair to question that aspect of the tsearch2 design because we've seen many reports of people having trouble updating databases that use tsearch2. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: The other point is that we should have a good idea of the API because if it gets into 8.3 it will be harder to change. Yeah, once it's in core we have a pretty strong backwards-compatibility restriction to deal with. Someone upthread claimed we can always simplify it later but that's exactly backward --- we can add features later, but we can't subtract. Maybe we should be looking to implement just the minimum set of features for 8.3 and leave some of the more controversial stuff for 8.4. I hate to admit it, but if we take that point of view then triggers are in and functional-index support is out. We have to support the trigger approach because it's what is in tsearch2 now, and the existing users will expect to continue to have that option. Triggers and expression indexes were both in the documentation Oleg supplied, so I am sure both are being used. I bet some users don't even know they are using expression indexes because creating a GIN index on a column automatically casts to tsvector. (But GIST does not.) I had to ask Oleg to find out this out. However, allowing the standard triggers to pay attention to a configuration GUC variable is simply broken; that bit has to go away. The only trigger example supplied by Oleg and Teodor is tsvector_update_trigger(), and that doesn't take a configuration name, meaning it uses the default GUC configuration. Uh, how are we going to prevent the auto-casting to tsvector from using the default GUC config, e.g. CREATE INDEX i ON x USING GIN(col)? This is where I started to see the need for education and error-prone nature of the default GUC just wasn't worth having it, though I know others disagree. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Since I don't think that a datatype solution is the way to go, I don't feel that we are as far away from an agreement as Bruce is worried about. Well, from where I sit, there is one person saying give me the foot gun, and Heikki saying he wants a bullet-proof type system, and you and I are in the middle, so the big problem is I don't see a concensus forming, and we have been discussing this for a while. The people who actually use tsearch2 seem to all have the same opinion ... so I think we can't go too far in the bullet-proofing direction. But I would like a design that is bulletproof in dump/reload scenarios, and I think it's fair to question that aspect of the tsearch2 design because we've seen many reports of people having trouble updating databases that use tsearch2. Yea, look at the trouble we are having trying to underestand it all. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] CVS corruption/mistagging?
I wrote: Great --- launching cvs rtag command now. Done, and I got a plausible-looking mix of messages like cvs rtag: Not removing branch tag `REL7_4_STABLE' from `/cvsroot/pgsql/src/tutorial/funcs_new.c,v'. and a fresh checkout of REL7_4_STABLE now matches what I had locally. So I think we walked away from that one. TODO items: * restrict, or at least log, cvs tag/rtag commands. Maybe report them to pgsql-committers. * I notice that the repository contains some files owned by no-longer-existent accounts, eg 258 in contrib/btree_gist: ls -la total 460 drwxrwxr-x 6 scrappy dev 1024 Aug 15 03:00 . drwxrwxr-x 90 scrappy dev 2048 Aug 15 03:00 .. drwxrwxr-x 2 scrappy dev512 Apr 20 15:16 Attic -r--r--r-- 1 tgl dev 9555 Jun 26 22:05 Makefile,v -r--r--r-- 1 258 dev 5870 Apr 20 16:19 README.btree_gist,v -r--r--r-- 1 tgl dev 12033 Aug 15 03:00 btree_bit.c,v -r--r--r-- 1 tgl dev 9902 Aug 15 03:00 btree_bytea.c,v -r--r--r-- 1 tgl dev 10227 Aug 15 03:00 btree_cash.c,v -r--r--r-- 1 tgl dev 11414 Aug 15 03:00 btree_date.c,v -r--r--r-- 1 tgl dev 10211 Aug 15 03:00 btree_float4.c,v -r--r--r-- 1 tgl dev 10074 Aug 15 03:00 btree_float8.c,v -r--r--r-- 1 tgl dev 33915 Aug 15 03:00 btree_gist.c,v -r--r--r-- 1 258 dev 4325 Apr 20 15:16 btree_gist.h,v -r--r--r-- 1 258 dev 58744 Apr 20 16:19 btree_gist.sql.in,v -r--r--r-- 1 tgl dev 14717 Aug 15 03:00 btree_inet.c,v -r--r--r-- 1 tgl dev 10234 Aug 15 03:00 btree_int2.c,v -r--r--r-- 1 tgl dev 10221 Aug 15 03:00 btree_int4.c,v -r--r--r-- 1 tgl dev 10235 Aug 15 03:00 btree_int8.c,v I suppose this is due to Magnus' cleanup of unused committers accounts. That was a good thing, but for security's sake these files ought to be chown'd to some existing committer's account. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch2 in PostgreSQL 8.3?
Bruce Momjian wrote: The people who actually use tsearch2 seem to all have the same opinion ... so I think we can't go too far in the bullet-proofing direction. Yeah. But I would like a design that is bulletproof in dump/reload scenarios, and I think it's fair to question that aspect of the tsearch2 design because we've seen many reports of people having trouble updating databases that use tsearch2. Yea, look at the trouble we are having trying to underestand it all. True. But I wasn't too concerned about the forecast difficulties with data only dumps. Those fail in plenty of circumstances. It is important that there is *some* reliable dump/restore/upgrade path, though. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Gregory Stark [EMAIL PROTECTED] wrote: Mike Rylander [EMAIL PROTECTED] writes: My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. This is *extremely* useful. I think it's precisely what we've been missing so far. At least, what I've been missing. So the question is what exactly happens in this case? If I search for the does that mean it will ignore matches in English where that's a stop-word but find me books on tea in French? Is that what I should expect to happen? What if I search for earl and the? Does that find me French books on Early Grey Tea but English books on all earls? Oh dear ... you went and got me started... Each field type has a different set of configurations for potenial use. Title and subject fields, for instance, do not (generally) use stop-word lists, so a subject search for the will match any record with the lexem the in a subject field. Title fields are a little more complicated, because there is information in the bibliographic record about how and when to skip leading articles, but generally those are indexed as well for FTS purposes. English LCSH subjects generally don't have stop(like) words in them, so you'll probably just get French tea records. Title results would be a mix of earls and French tea records (probably) and the correlation between the user's preferred language (either chosen query lang or UI lang, whichever is available) will help adjust the ranking, pushing what are likely to be the more appropriate records to the top. Note, however, that much of this multi-tsearch2-configuration setup is not used in the implementation at http://gapines.org/ because, well, there's not much need (read: demand from librarians) for that dataset to support these more complex tricks. It's basically all en-US and exclude stop-words. Other implementations are making more use of what I describe above, including a (government mandated) French-English bilingual institution who shall remain nameless for the time being... What happens if I use the same operator directly on the text column? Or perhaps it's not even possible to specify stop-words when operating on a text column? Should it be? You mean with an expression index on a text column? I haven't considered using them for FTS. It just feels easier and more flexible to me to use an external tsvector column because of the fairly heavy processing that goes into creating each tsvector value. I may re-evaluate that position now that CREATE INDEX CONCURRENTLY exists, but I'm not developing with 8.2+ only features yet. Once 8.3 is out that may change. Also, unless I misunderstand, you have to wrap the text column in the function used to build the index. For my purposes, that makes building a generic FTS driver for my app (which, admittedly, only has a Postgresql driver ;) ) more difficult than having a hidden extra column. Again, that could change if the benefits of CREATE INDEX CONCURRENTLY end up outweighing simpler FTS driver code. --miker ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CVS corruption/mistagging?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Tuesday, August 14, 2007 23:26:03 -0400 Tom Lane [EMAIL PROTECTED] wrote: I wrote: Great --- launching cvs rtag command now. Done, and I got a plausible-looking mix of messages like cvs rtag: Not removing branch tag `REL7_4_STABLE' from `/cvsroot/pgsql/src/tutorial/funcs_new.c,v'. and a fresh checkout of REL7_4_STABLE now matches what I had locally. So I think we walked away from that one. TODO items: * restrict, or at least log, cvs tag/rtag commands. Maybe report them to pgsql-committers. working on it ... have code for it, I've just spent so much time with CVSROOT = /cvsroot, I have to relearn remote CVS :) ls -la total 460 drwxrwxr-x 6 scrappy dev 1024 Aug 15 03:00 . drwxrwxr-x 90 scrappy dev 2048 Aug 15 03:00 .. drwxrwxr-x 2 scrappy dev512 Apr 20 15:16 Attic -r--r--r-- 1 tgl dev 9555 Jun 26 22:05 Makefile,v -r--r--r-- 1 258 dev 5870 Apr 20 16:19 README.btree_gist,v -r--r--r-- 1 tgl dev 12033 Aug 15 03:00 btree_bit.c,v -r--r--r-- 1 tgl dev 9902 Aug 15 03:00 btree_bytea.c,v -r--r--r-- 1 tgl dev 10227 Aug 15 03:00 btree_cash.c,v -r--r--r-- 1 tgl dev 11414 Aug 15 03:00 btree_date.c,v -r--r--r-- 1 tgl dev 10211 Aug 15 03:00 btree_float4.c,v -r--r--r-- 1 tgl dev 10074 Aug 15 03:00 btree_float8.c,v -r--r--r-- 1 tgl dev 33915 Aug 15 03:00 btree_gist.c,v -r--r--r-- 1 258 dev 4325 Apr 20 15:16 btree_gist.h,v -r--r--r-- 1 258 dev 58744 Apr 20 16:19 btree_gist.sql.in,v -r--r--r-- 1 tgl dev 14717 Aug 15 03:00 btree_inet.c,v -r--r--r-- 1 tgl dev 10234 Aug 15 03:00 btree_int2.c,v -r--r--r-- 1 tgl dev 10221 Aug 15 03:00 btree_int4.c,v -r--r--r-- 1 tgl dev 10235 Aug 15 03:00 btree_int8.c,v I suppose this is due to Magnus' cleanup of unused committers accounts. That was a good thing, but for security's sake these files ought to be chown'd to some existing committer's account. I can do a quick chown -R scrappy on the whole repository ... ok? - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFGwnf34QvfyHIvDvMRAnGTAJ43gELVZ5Q5SpV/jppR9rODB/ILaACeKCEN DE16fnY/PmKlrvdHFaJeIKI= =VCFu -END PGP SIGNATURE- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CVS corruption/mistagging?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 - --On Tuesday, August 14, 2007 23:26:03 -0400 Tom Lane [EMAIL PROTECTED] wrote: * restrict, or at least log, cvs tag/rtag commands. Maybe report them to pgsql-committers. It should be done ... if you try and create a tag, it should generate an error message ... not sure of the best way to test that it actually works, since, if it doesn't work, of course, you'll add on another tag, but do you want to try and create a 'fake tag' and then delete it afterwards? - Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . [EMAIL PROTECTED] MSN . [EMAIL PROTECTED] Yahoo . yscrappy Skype: hub.orgICQ . 7615664 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFGwnoO4QvfyHIvDvMRArHSAJ4qYuexUGn6Ff3/PkHU/DjJBlCQnwCgxyy9 Qc852wMVRRMYdwoD3+p7X3Y= =FCD9 -END PGP SIGNATURE- ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq