With the talk about adding compression to pg_dump lately, I've been
wondering if tables and indexes could be compressed too.
So I've implemented a quick on-the-fly compression patch for postgres
Sorry for the long email, but I hope you find this interesting.
Why compress ?
1- To sa
PFC writes:
- for short strings (average 12 bytes), sort is CPU-bound in strcoll()
- for longer strings (average 120 bytes), sort is even more CPU-bound in
strcoll()
No news there. If you are limited by the speed of text comparisons,
consider using C locale.
regards
There was a thread some time ago about sorting... it kind of died...
I did some tests on a desktop (Postgres 8.3.7, kubuntu, Core 2 dual core,
4GB RAM, RAID1 of 2 SATA disks)
Quick conclusions :
- grabbing the stuff to sort can be IO bound of course (not here)
- for short strings (average 12
Sure, modifying the WHERE clause is still possible, but the attacker is
a lot more limited in what he can do if he can't tack on a whole new
command.
I hacked into a site like that some day to show a guy that you shouldn't
trust magicquotes (especially when you switch hosting providers and
Could we also get a mode, where PREPARE would only be allowed for
queries of the form "SELECT * FROM func(?,?,?,?,?); :)
Actually, that is similar to the concept of "global prepared statements"
that I proposed some time ago, but I will not have time to write the
patch, alas...
Idea was t
For example, some applications need to replace whole phrases:
$criteria = "WHERE $var1 = '$var2'"
This is a very common approach for dynamic search screens, and really not
covered by placeholder approaches.
Python, again :
params = {
'column1': 10,
'column2': "a st'rin
zero developer pain
Actually it's not zero pain, but the main problem is: there is no way
to enforce using it.
Sure, there is no way to enforce it (apart from grepping the source for
pg_query() and flogging someone if it is found), but is it really
necessary when the right solution is eas
On Tue, 29 Apr 2008 01:03:33 +0200, Brendan Jurd <[EMAIL PROTECTED]> wrote:
On Tue, Apr 29, 2008 at 7:00 AM, PFC <[EMAIL PROTECTED]> wrote:
I have found that the little bit of code posted afterwards did
eliminate
SQL holes in my PHP applications with zero developer pain, actu
As you know, "SQL injection" is the main security problem of databases
today.
I think I found a solution: 'disabling literals'. Or you may call it
'enforcing the use of parameterized statements'. This means that SQL
statements with embedded user input are rejected at runtime. My
solution goes
Example : let's imagine a "cache priority" setting.
Which we can presume the DBA will set incorrectly because the tools
needed to set that right aren't easy to use.
LOL, yes.
Jim threw out that you can just look at the page hit percentages
instead. That's not completely true. I
It started with this query :
EXPLAIN ANALYZE SELECT * FROM relations r JOIN nodes n ON
(n.id=r.child_id) WHERE r.parent_id=16330;
QUERY PLAN
On Thu, 17 Apr 2008 20:42:49 +0200, Simon Riggs <[EMAIL PROTECTED]>
wrote:
On Thu, 2008-04-17 at 12:34 -0400, Tom Lane wrote:
I'm tempted to propose redefining the currently-nearly-useless
EXPLAIN VERBOSE option as doing this.
Yes please.
Sounds like a good home for other useful things al
My wife has a snake phobia, besides, I've just started learning Scala.
Just had a look at Scala, it looks nice. Slightly Lispish (like all good
languages)...
txid_current()
No... hold on, it is per session, and a session can't have two or more
transactions active at once can it?
On Thu, 17 Apr 2008 02:48:37 +0200, Stephen Denne
<[EMAIL PROTECTED]> wrote:
PFC wrote:
Let's try this quick & dirty implementation of a local
count-delta cache
using a local in-memory hashtable (ie. {}).
CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEG
The whole thing is a bit of an abuse of what the mechanism
was intended
for, and so I'm not sure we should rejigger GUC's behavior to make it
more pleasant, but on the other hand if we're not ready to provide a
better substitute ...
In my experiments with materialized views, I identified these
On Mon, 14 Apr 2008 16:17:18 +0200, Csaba Nagy <[EMAIL PROTECTED]> wrote:
On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote:
... or plan the query with the actual parameter value you get, and also
record the range of the parameter values you expect the plan to be valid
for. If at execution ti
Bind message behaviour was modified :
- If the user asks for execution of a named prepared statement, and the
named statement does not exist in PG's prepared statements cache,
instead
of issuing an error and borking the transaction, it Binds to an empty
statement, that takes no parameters, an
If cached plans would be implemented, the dependence on parameter values
could be solved too: use special "fork" nodes in the plan which execute
different sub-plans depending on special parameter values/ranges,
possibly looking up the stats at runtime, so that the plan is in a
compiled state wit
Why limit ourselves with Oracle? How all major proprietary RDBMSs do it.
Thanks for the links. Very interesting.
The DB2 document especially mentions an important point : in order to make
their planner/optimizer smarter, they had to make it slower, hence it
became crucial to cache the plans
On Fri, Apr 11, 2008 at 12:34 PM, PFC <[EMAIL PROTECTED]> wrote:
Well, I realized the idea of global prepared statements actually
sucked, so I set on another approach thanks to ideas from this list,
this is
caching query plans.
Well, that's a blatantly bad realizatio
Well if you're caching per-connection then it doesn't really matter
whether
you do it on the client side or the server side, it's pretty much
exactly the
same problem.
Actually I thought about doing it on the server since it would then also
work with connection pooling.
Doi
I think what he's referring to is persistently caching plans so that new
connections can use them. That makes a lot more sense if you have lots of
short-lived connections like a stock php server without persistent
connections
turned on or a connection pooler. You can prepare queries but they o
Well, I realized the idea of global prepared statements actually sucked,
so I set on another approach thanks to ideas from this list, this is
caching query plans.
First, let's see if there is low hanging fruit with the typical small,
often-executed queries that are so frequent on website
* Dumb Optimization #1:
- Add executorFunc function pointer to struct PlanState
- in ExecProcNode.c -> ExecProcNode() :
- upon first execution, set executorFunc to the function corresponding to
node type
- next calls use function pointer
Effect : removes a switch (nodeTag(no
PFC wrote:
About the FSM :
Would it be possible to add a flag marking pages where all tuples
are visible to all transactions ? (kinda like frozen I think)
Ah, the visibility map. That's another line of discussion. The current
plan is to not tie that to the FSM, but impleme
About the FSM :
Would it be possible to add a flag marking pages where all tuples are
visible to all transactions ? (kinda like frozen I think)
This could be useful to implement index-only scans, for count(), or to
quickly skip rows when OFFSET is used, or to use only the index whe
On Sat, 05 Apr 2008 02:17:10 +0100
Gregory Stark <[EMAIL PROTECTED]> wrote:
I was inclined to dismiss it myself but I think the point that's come
up here
is interesting. The ISP has to not just install an RPM or type make
install in
some source tree -- but actually log into each customer's dat
INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
'-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken
records);
What do we gain against current way of doing it, which is:
COPY loadtable FROM 'dump.txt' WITH ...
INSERT INTO destination_table(...) SELECT .
On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <[EMAIL PROTECTED]> wrote:
On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
CREATE FLATFILE READER mydump (
id INTEGER,
dateTEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syn
Data transformation while doing a data load is a requirement now and
then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right
way w
The MAJOR benefit of Microsoft's approach is that it works on existing
application,
Yes, that is a nice benefit !
Is there a way to turn it on/off ? Or is it smart enough to only cache
plans for cases where it is relevant ?
For instance, I absolutely want some queries to be planne
On Tue, 01 Apr 2008 16:06:01 +0200, Tom Lane <[EMAIL PROTECTED]> wrote:
Dave Cramer <[EMAIL PROTECTED]> writes:
Was the driver ever changed to take advantage of the above strategy?
Well, it's automatic as long as you use the unnamed statement. About
all that might need to be done on the clie
* Server restart and assorted like failover (you need to redo a
global prepare).
Hmm? He's proposing storing the info in a system catalog. That hardly
seems "volatile"; it'll certainly survive a server restart.
Yes, it's in a system catalog.
I agree with the point that th
Here is the result of my little experiment, for your viewing pleasure,
lol.
Now it works. Well, it hasn't crashed yet... so I guess I can show it to
people ;)
- Purpose :
Allow PHP (or other languages) users to use prepared statements
(pg_exec()) together with persistent connections, w
PFC wrote:
Hello,
So, I embarked (yesterday) on a weekend project to add a new
feature to Postgres...
I use PHP with persistent connections and always have been bothered
that those very small AJAX queries (usually simple selects returning 1
row) take more CPU in postgres to
Hello,
So, I embarked (yesterday) on a weekend project to add a new feature to
Postgres...
I use PHP with persistent connections and always have been bothered that
those very small AJAX queries (usually simple selects returning 1 row)
take more CPU in postgres to parse & plan than
On Fri, 22 Jun 2007 16:43:00 +0200, Bruce Momjian <[EMAIL PROTECTED]> wrote:
Simon Riggs wrote:
On Fri, 2007-06-22 at 14:29 +0100, Gregory Stark wrote:
> "Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>
> > Tom Lane wrote:
> >
> >> untrustworthy disk hardware, for instance. I'd much rather use
So now we're poking a hole in that but we certainly have to ensure that
any
transactions that do see the results of our deferred commit themselves
don't
record any visible effects until both their commit and ours hit WAL. The
essential point in Simon's approach that guarantees that is that w
On Fri, 15 Jun 2007 22:28:34 +0200, Gregory Maxwell <[EMAIL PROTECTED]>
wrote:
On 6/15/07, Gregory Stark <[EMAIL PROTECTED]> wrote:
While in theory spreading out the writes could have a detrimental
effect I
think we should wait until we see actual numbers. I have a pretty strong
suspicion t
>If we extended relations by more than 8k at a time, we would know a lot
>more about disk layout, at least on filesystems with a decent amount of
>free space.
I doubt it makes that much difference. If there was a significant amount
of fragmentation, we'd hear more complaints about seq scan perfor
My idea is that if an UPDATE places the new tuple on the same page as
the old tuple, it will not create new index entries for any indexes
where the key doesn't change.
Basically the idea behind preventing index bloat by updates is to have
one index tuple point to several actual tuples havin
What about increasing the size of an existing index entry? Can that be
done easily when a new row is added?
I'd say it looks pretty much like inserting a new index tuple...
Say "value" is the indexed column.
Find first page in the index featuring "value".
1 If t
There were some talks lately about compression.
With a bit of lateral thinking I guess this can be used to contain the
bloat induced by updates.
Of course this is just my hypothesis.
Compression in indexes :
Instead of storing (value, tuple identifier) keys in the i
What I see in this discussion is a huge amount of "the grass must be
greener on the other side" syndrome, and hardly any recognition that
every technique has its downsides and complications.
Sure ;)
MVCC generates dead rows, by its very nature ; however I see two trends
in this :
Well, then please help me find a better design cause I can't see one...
what we have here is a big "membership" table of email lists. When
there's a sendout then the memberships of the affected group are heavily
read/updated, otherwise they are idle. None of the memberships is
archive data, they
What you seem not to grasp at this point is a large web-farm, about 10 or
more servers running PHP, Java, ASP, or even perl. The database is
usually
the most convenient and, aside from the particular issue we are talking
about, best suited.
The answer is sticky sessions : each user is assi
As you can see, in about a minute at high load, this very simple table
lost about 10% of its performance, and I've seen worse based on update
frequency. Before you say this is an obscure problem, I can tell you it
isn't. I have worked with more than a few projects that had to switch
away
fro
Great minds think alike ;-) ... I just committed exactly that protocol.
I believe it is correct, because AFAICS there are only four possible
risk cases:
Congrats !
For general culture you might be interested in reading this :
http://en.wikipedia.org/wiki/Software_tran
It strikes me that we are using a single communication mechanism to
handle what are really two distinct kinds of data:
Interesting.
I recently read a paper on how to get rid of locks for this kind of
pattern.
* For the Command String
- Problem : need to display the currently ex
From what I gather, the CSV format dump would only contain data.
I think pg_dump is the friend of pg_restore. It dumps everything
including user defined functions, types, schemas etc. CSV does not fit
with this.
Besides, people will probably want to dump into CSV the result of any
Hello,
Sometimes people complain that UPDATE is slow in postgres. UPDATE...
- generates dead tuples which must be vacuumed.
- needs to hit all indexes even if only one column was modified.
From what I know UPDATE creates a new copy of the old row with the
rele
MySQL already does this for INSERT :
INSERT INTO x (a,b) VALUES (1,2), (3,4), (5,6)...;
Does MySQL really let you stream that? Trying to do syntax like that in
Postgres wouldn't work because the parser would try to build up a parse
tree
for the whole statement before runnin
I was also vaguely pondering whether all the DDL commands could be
generalized
to receive or send COPY formatted data for repeated execution. It would
be
neat to be able to prepare an UPDATE with placeholders and stream data
in COPY
format as parameters to the UPDATE to execute it thousand
Think about version API compatibility.
Suppose you have a working database on server A which uses module foo
version 1.
Some time passes, you buy another server B and install postgres on it.
Meanwhile the module foo has evolved into version 2 which is cooler, but
has some minor A
Hm, thinking again, I guess Tom Lane is right
Surely the initialization code would have to be run anyway ... and if
the function does import a pile of modules, do you really want to cache
all that in its pg_proc entry? What happens if some of the modules get
updated later?
Besides,
If it were really expensive to derive bytecode from source text
then maybe it'd make sense to do what you're doing, but surely that's
not all that expensive. Everyone else manages to parse prosrc on the
fly and cache the result in memory; why isn't plpython doing that?
It depends on the numb
> 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id
DESC
> LIMIT 20
> 0.443 ms ANALYZE tmp
> 0.365 ms SELECT * FROM tmp
> 0.310 ms DROP TABLE tmp
> 32.918 ms COMMIT
The 32 seconds for commit can hardly be catalog related. It seems the
file is
fsynced before it is dropped.
On Tue, May 09, 2006 at 06:29:31PM +0200, PFC wrote:
You mean the cursors'storage is in fact the same internal machinery
as a temporary table ?
Use the source, Luke...
LOL, yeah, I should have, sorry.
See tuplestore_begin_heap in backend/utils/sort/tuplest
Have you tried getting a profile of what exactly PostgreSQL is doing
that takes so long when creating a temp table?
Nope, I'm not proficient in the use of these tools (I stopped using C
some time ago).
BTW, I suspect catalogs might be the answer,
Probably, because :
-
Speaking of which, if a temp table is defined as ON COMMIT DROP or
DELETE ROWS, there shouldn't be any need to store xmin/xmax, only
cmin/cmax, correct?
Yes, that's that type of table I was thinking about...
You can't ROLLBACK a transaction on such a table.
You can however rol
The problem is that you need a set-returning function to retrieve
the values. SRFs don't have rowcount estimates, so the plans suck.
What about adding some way of rowcount estimation to SRFs, in the way of:
CREATE FUNCTION foo (para, meters) RETURNS SETOF bar AS
$$ ... function code ...
> SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)
Well, you can either
SELECT * FROM somewhere JOIN (SELECT id FROM result GROUP BY id) AS
a USING (id);
It's the same thing (and postgres knows it)
You might want to use PL to store values, say PLperl, or even C
Does the time for commit change much if you leave out the analyze?
Yes, when I don't ANALYZE the temp table, commit time changes from 30 ms
to about 15 ms ; but the queries get horrible plans (see below) :
Fun thing is, the rowcount from a temp table (which is the problem here)
should
Creating cursors for a simple plan like a single sequential scan is fast
because it's using the original data from the table.
I used the following query :
SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20
It's a backward index scan + limit... not a seq scan. And it's
It would be interesting to know what the bottleneck is for temp tables
for you. They do not go via the buffer-cache, they are stored in
private memory in the backend, they are not xlogged. Nor flushed to
disk on backend exit. They're about as close to in-memory tables as
you're going to get...
Additionally to your query you are already transferring the whole result
set multiple times. First you copy it to the result table. Then you
read it again. Your subsequent queries will also have to read over
all the unneeded tuples just to get your primary key.
Considering that the resul
You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query. If you
currently use "select *" this can make your result set very large.
Copying all the result set to the temp. costs you additional IO
that you propably dont need.
It i
The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.
Ever experiment with loading the parameters into a temp table and
joining to that?
Also, it might be worth re-testing that conclusion with
Bulk loading speed is irrelevant here - that is dominated by parsing,
which
we have covered copiously (har har) previously and have sped up by 500%,
which still makes Postgres < 1/2 the loading speed of MySQL.
Let's ask MySQL 4.0
LOAD DATA INFILE blah
0 errors, 666 warnings
SHOW
Just to add a little anarchy in your nice debate...
Who really needs all the results of a sort on your terabyte table ?
I guess not many people do a SELECT from such a table and want all the
results. So, this leaves :
- Really wanting all the results, to fetch using
On my machine (Laptop with Pentium-M 1.6 GHz and 512MB DDR333) I get the
following timings :
Big Joins Query will all the fields and no order by (I just put a SELECT
* in the first table) yielding about 6k rows :
=> 12136.338 ms
Replacing the SELECT * from the table with many fields by ju
I have asked him for the data and played with his queries, and obtained
massive speedups with the following queries :
http://boutiquenumerique.com/pf/miroslav/query.sql
http://boutiquenumerique.com/pf/miroslav/query2.sql
http://boutiquenumerique.com/pf/miroslav/materialize.sql
Note that my opt
72 matches
Mail list logo