from:"Kynn Jones"

Re: [PERFORM] How to "unique-ify" HUGE table?

2008-12-24 Thread Kynn Jones

Thank you all for the very helpful advice.  Upping work_mem made it possible
for me to generate the table within this century without bringing the server
to a near standstill.  I have not yet experimented with GROUP BY, but I'll
do this next.

Cheers,

Kynn

[PERFORM] How to "unique-ify" HUGE table?

2008-12-23 Thread Kynn Jones

Hi everyone!
I have a very large 2-column table (about 500M records) from which I want to
remove duplicate records.

I have tried many approaches, but they all take forever.

The table's definition consists of two short TEXT columns.  It is a
temporary table generated from a query:

CREATE TEMP TABLE huge_table AS SELECT x, y FROM ... ;

Initially I tried

CREATE TEMP TABLE huge_table AS SELECT DISTINCT x, y FROM ... ;

but after waiting for nearly an hour I aborted the query, and repeated it
after getting rid of the DISTINCT clause.

Everything takes forever with this monster!  It's uncanny.  Even printing it
out to a file takes forever, let alone creating an index for it.

Any words of wisdom on how to speed this up would be appreciated.

TIA!

Kynn

Re: [PERFORM] How to profile an "SQL script"?

2008-12-03 Thread Kynn Jones

Andreas, Heikki:

Thanks!
Kynn

[PERFORM] How to profile an "SQL script"?

2008-12-03 Thread Kynn Jones

Hi.  I have a longish collection of SQL statements stored in a file that I
run periodically via cron.  Running this "script" takes a bit too long, even
for a cron job, and I would like to streamline it.
Is there a way to tell Postgres to print out, after each SQL statement is
executed, how long it took to execute?

Thanks!

Kynn

[PERFORM] Performance and IN clauses

2008-11-18 Thread Kynn Jones

Hi.  I have a Perl script whose main loop generates thousands of SQL updates
of the form
UPDATE edge SET keep = true WHERE node1 IN ( $node_list ) AND node2 =
$node_id;

...where here $node_list stands for a comma-separated list of integers, and
$node_id stands for some integer.

The list represented by $node_list can be fairly long (on average it has
around 900 entries, and can be as long as 30K entries), and I'm concerned
about the performance cost of testing for inclusion in such a long list.  Is
this done by a sequential search?  If so, is there a better way to write
this query?  (FWIW, I have two indexes on the edge table using btree( node1
) and btree( node2 ), respectively.)

Also, assuming that the optimal way to write the query depends on the length
of $node_list, how can I estimate the "critical length" at which I should
switch from one form of the query to the other?

TIA!

Kynn

Re: [PERFORM] The "many nulls" problem

2008-03-14 Thread Kynn Jones

On Fri, Mar 14, 2008 at 2:59 PM, Oleg Bartunov <[EMAIL PROTECTED]> wrote:

> have you seen contrib/hstore ? You can have one table with common
> attributes
> and hide others in hstore
>

That's interesting.  I'll check it out.  Thanks!

Kynn

Re: [PERFORM] The "many nulls" problem

2008-03-14 Thread Kynn Jones

On Fri, Mar 14, 2008 at 3:46 PM, Heikki Linnakangas <[EMAIL PROTECTED]>
wrote:
>
> 
>
 From performance point of view, I would go with a single table with
> NULL fields on PostgreSQL.


Wow.  I'm so glad I asked!  Thank you very much!

Kynn

[PERFORM] The "many nulls" problem

2008-03-14 Thread Kynn Jones

It often happens that a particular pieces of information is non-null for a
small minority of cases.  A superficially different manifestation of this is
when two pieces of information are identical in all but a small minority of
cases.  This can be easily mapped to the previous description by defining a
null in one column to mean that its contents should be obtained from those
of another column.  A further variant of this is when one piece of
information is a simple function of another one in all but a small minority
of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this
particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  Maybe this is true for fixed-length data types, but what
about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home
shopping website.  Suppose that, as it happens, for the majority of this
site's customers, the shipping and billing addresses are identical.  Or
consider the scenario of a company in which, for most employees, the email
address can be readily computed from the first and last name using the rule
First M. Last => [EMAIL PROTECTED], but the company allows some
flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
known to everyone by his nickname, Yaz, the email is
[EMAIL PROTECTED] hardly anyone remembers or even knows his
full name.)

What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?

TIA!

Kynn

Re: [PERFORM] Joins and DELETE FROM

2008-03-11 Thread Kynn Jones

Thank you for your post.  I finally spent some quality time with the query
planner section in the docs' server config chapter.  Very instructive, even
considering that most of it went over my head!

On Sat, Mar 8, 2008 at 4:08 PM, Tom Lane <[EMAIL PROTECTED]> wrote:

...have you got effective_cache_size set to something that's realistic for
> your machine?

I guess not.  It was the default value (128MB) on a machine with 4GB of RAM.
 It's not a dedicated server, though, so I'll set it to 1G for now.

But before doing so I need a clarification.  The docs state that this
parameter is used only for cost estimation, and has no effect on actual
memory allocations.  I imagine that if other memory-related settings are not
somehow in line with it, it could lead to estimates that are out of touch
with reality.  If this is correct what other memory-related parameters do I
need to adjust to ensure that both the planner's estimates and the actual
execution agree and fit well with the available memory?

One problem with this test is that your smaller tables probably fit in
> memory whereas the big ones may not, so it's not a given that your test
> accurately reflects how the real query will go down.
>

That's a very helpful reminder.  Thanks.

Kynn

Re: [PERFORM] Joins and DELETE FROM

2008-03-08 Thread Kynn Jones

On Sat, Mar 8, 2008 at 1:01 PM, Heikki Linnakangas <[EMAIL PROTECTED]>
wrote:

> Kynn Jones wrote:
> > Hi!
> >
> > As part of a data warehousing project, I need to pre-process data
> downloaded
> > from an external source, in the form of several large flat files.  This
> > preprocessing entails checking the validity of various data items, and
> > discarding those that fail to pass all the checks.
> >
> > Currently, the code that performs the checks generates intermediate
> > temporary tables of those bits of data that are invalid in some way.
>  (This
> > simplifies the process of generating various quality-control reports
> about
> > the incoming data).
> >
> > The next step is to weed out the bad data from the main tables, and
> here's
> > where I begin to get lost.
> >
> > To be concrete, suppose I have a table T consisting of 20 million rows,
> > keyed on some column K.  (There are no formal constrains on T at the
> moment,
> > but one could define column K as T's primary key.)  Suppose also that I
> have
> > a second table B (for "bad") consisting of 20 thousand rows, and also
> keyed
> > on some column K.  For each value of B.K there is exactly one row in T
> such
> > that T.K = B.K, and the task is to delete all these rows from T as
> > efficiently as possible.
> >
> > My naive approach would something like
> >
> > DELETE FROM T WHERE T.K IN ( SELECT K FROM B );
> >
> > ...which, according to EXPLAIN, is a terrible idea, because it involves
> > sequentially scanning all 20 million rows of T just to delete about only
> > 0.1% of them.
> >
> > It seems to me better to sequentially scan B and rely on an index on T
> to
> > zero-in the few rows in T that must be deleted.
> >
> > Is this strategy something that can be done with plain SQL (even if to
> do
> > this I must produce additional helper tables, indices, etc.), or must I
> > write a stored procedure to implement it?
>
> The planner knows how to produce such a plan, so it must thinking that
> it's not the fastest plan.


Curious.


> Have you ANALYZEd the tables? You do have an index on T.K, right? What
> does EXPLAIN ANALYZE output look like? (you can do BEGIN; EXPLAIN
> ANALYZE ...; ROLLBACK; if you don't want to actually delete the rows)


Yes, all the tables have been vacuumed and analyzed, and there's an index on
T.K (and on also on B.K for good measure).


> You can try to coerce the planner to choose the indexscan with "set
> enable_seqscan=off", to see how fast it actually is.


Thanks, that was a useful trick.  I tried it on a simpler case: just the
natural join of T and B.  (I also used smaller versions of the table, but
with a size ratio similar to the one in my hypothetical example.)  Indeed,
when I turn off sequential scans, the resulting query is over 2X faster.

my_db=> SET ENABLE_SEQSCAN TO ON;
my_db=> EXPLAIN ANALYZE SELECT * FROM T NATURAL JOIN B;
   QUERY PLAN


 Hash Join  (cost=7634.14..371997.64 rows=219784 width=13) (actual time=
176.065..12041.486 rows=219784 loops=1)
   Hash Cond: (t.k = b.k)
   ->  Seq Scan on t  (cost=0.00..172035.56 rows=10509456 width=13) (actual
time=0.023..2379.407 rows=10509456 loops=1)
   ->  Hash  (cost=3598.84..3598.84 rows=219784 width=12) (actual time=
171.868..171.868 rows=219784 loops=1)
 ->  Seq Scan on b  (cost=0.00..3598.84 rows=219784 width=12)
(actual time=0.013..49.626 rows=219784 loops=1)
 Total runtime: 12064.966 ms
(6 rows)

my_db=> SET ENABLE_SEQSCAN TO OFF;
my_db=> EXPLAIN ANALYZE SELECT * FROM T NATURAL JOIN B;
  QUERY PLAN
---
 Merge Join  (cost=0.00..423589.69 rows=219784 width=13) (actual time=
0.114..5449.808 rows=219784 loops=1)
   Merge Cond: (t.k = b.k)
   ->  Index Scan using idx__t on t  (cost=0.00..386463.71 rows=10509456
width=13) (actual time=0.059..3083.182 rows=10509414 loops=1)
   ->  Index Scan using idx__b on b  (cost=0.00..8105.04 rows=219784
width=12) (actual time=0.044..69.659 rows=219784 loops=1)
 Total runtime: 5473.812 ms
(5 rows)


Honestly, I still have not learned to fully decipher the output of
EXPLAN/EXPLAIN ANALYZE.  (The PostgreSQL docs are generally superb, IMO, but
there's still a big hole on the subject of the query planner, including the
interpretation of these query plans.)

So it seems like turning off ENABLE_SEQSCAN is the way to go.  I wonder how
much faster the query would be if I could selectively turn of the sequential
scan on T.  (The one on B seems to me reasonable.)

You could also write the query as DELETE FROM t USING b WHERE t.k = b.k,

but I doubt it makes much difference.


You're right: no difference at all (same query plan).

Thanks again!

Kynn

[PERFORM] Joins and DELETE FROM

2008-03-08 Thread Kynn Jones

Hi!

As part of a data warehousing project, I need to pre-process data downloaded
from an external source, in the form of several large flat files.  This
preprocessing entails checking the validity of various data items, and
discarding those that fail to pass all the checks.

Currently, the code that performs the checks generates intermediate
temporary tables of those bits of data that are invalid in some way.  (This
simplifies the process of generating various quality-control reports about
the incoming data).

The next step is to weed out the bad data from the main tables, and here's
where I begin to get lost.

To be concrete, suppose I have a table T consisting of 20 million rows,
keyed on some column K.  (There are no formal constrains on T at the moment,
but one could define column K as T's primary key.)  Suppose also that I have
a second table B (for "bad") consisting of 20 thousand rows, and also keyed
on some column K.  For each value of B.K there is exactly one row in T such
that T.K = B.K, and the task is to delete all these rows from T as
efficiently as possible.

My naive approach would something like

DELETE FROM T WHERE T.K IN ( SELECT K FROM B );

...which, according to EXPLAIN, is a terrible idea, because it involves
sequentially scanning all 20 million rows of T just to delete about only
0.1% of them.

It seems to me better to sequentially scan B and rely on an index on T to
zero-in the few rows in T that must be deleted.

Is this strategy something that can be done with plain SQL (even if to do
this I must produce additional helper tables, indices, etc.), or must I
write a stored procedure to implement it?


TIA!

Kynn

Re: [PERFORM] Q on views and performance

2008-02-26 Thread Kynn Jones

On Mon, Feb 25, 2008 at 11:56 AM, Matthew <[EMAIL PROTECTED]> wrote:

> On Mon, 25 Feb 2008, Kynn Jones wrote:
> > This is just GREAT!!!  It fits the problem to a tee.
>
> It makes the queries quick then?

It is good that you ask.  Clearly you know the story: a brilliant-sounding
optimization that in practice has only a small effect at best...

I'm totally puzzled.  It makes absolutely no sense to me...

For my analysis, in addition to creating the index on (type, zipk) that you
suggested, I also added an extra column to T containing a random integer in
the range 0..99, and created an index on this, so that I could produce a
totally "shuffled clustering".  I compared the performance in going from a
randomly-clustered table to a (type, zipk)-clustered table, and the output
of EXPLAIN was encouraging, but when I ran the actual queries under EXPLAIN
ANALYZE the difference in execution time was negligible.

Live and learn!

Actually, what's driving me absolutely insane is the documentation for
EXPLAIN and for Pg's query planning in general.  I've read the docs (in
particular the chapter on performance), but I still can't make any sense of
EXPLAINs results, so I can't begin to understand why optimizations like the
one you suggested turned out to be ineffective.  For example, the first
lines of two recent EXPLAIN ANALYZE outputs are

Nested Loop Left Join  (cost=58.00..1154.22 rows=626 width=26) (actual time=
1.462..26.494 rows=2240 loops=1)
Merge Left Join  (cost=33970.96..34887.69 rows=58739 width=26) (actual time=
106.961..126.589 rows=7042 loops=1)

Actual runtimes are 27ms and 128ms.  The ratio 128/27 is much smaller than
one would expect from the relative costs of the two queries.  It looks like
there is no proportionality at all between the estimated costs and actual
running time...  (BTW, all these runs of EXPLAIN were done after calls to
VACUUM ANALYZE.)  This is one of the many things I don't understand about
this case...

What I would like to be able to do is to at least make enough sense of query
plans to determine whether they are reasonable or not.  This requires
knowing the algorithms behind each type of query tree node, but I have not
found this info...

On the positive side, in the course of all this analysis I must have done
*something* to improve the performance, because now even the unoptimized
queries are running pretty fast (e.g. queries that used to take about
1.5seconds are now taking 130ms).  But unfortunately I don't know what
was it
that I did to bring this speed-up about!

Anyway, be that as it may, thank you very much for your suggestion.

Kynn

Re: [PERFORM] Q on views and performance

2008-02-25 Thread Kynn Jones

On Mon, Feb 25, 2008 at 8:45 AM, Matthew <[EMAIL PROTECTED]> wrote:

> On Fri, 22 Feb 2008, Kynn Jones wrote:
> > Hi.  I'm trying to optimize...
> >
> > (Q1)   SELECT a1.word, a2.word
> > FROM T a1 JOIN T a2 USING ( zipk )
> >WHERE a1.type = 
> >  AND a2.type = ;
>
> Okay, try this:
>
> Create an index on T(type, zipk), and then CLUSTER on that index...

This is just GREAT!!!  It fits the problem to a tee.

Many, many thanks!

Also, including zipk in the index is a really nice extra boost.  (If you
hadn't mentioned it I would have just settled for clustering only on
type...)

Thanks for that also!

Kynn

Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones

Hi, Dean.  The system I'm working with is very similar "in spirit" to a
large multilingual dictionary covering 100 languages.  Using this analogy,
the "type" column would correspond to the language, and the zipk column
would correspond to some language-independent key associated with a concept
("concept key" for short).  So, if it were indeed a multilingual dictionary,
records in T would look like
  word   | zipk | language
-+--+---
 house   | 1234 | 
 casa| 1234 | 
 haus| 1234 | 
 piano   | 2345 | 
 piano   | 2345 | 
 cat | 3456 | 
 chat| 3456 | 
 chat| 4567 | 
 plausch | 4567 | 

...where I used the notation  to denote "the integer id assigned to
language lang".  Therefore typically there are about 100 records in T for
any given zipk, one for each language.  But the correspondence is not
perfect, since, for example, some languages have, proverbially, more than
one word for snow, and some (maybe from some tropical island in the South
Pacific) have none.  (This last case, BTW, is what accounts for the use of
left joins, as will become clear in a minute.)

The table S can be thought of a table consisting of a collection of words to
be translated to some target language.  In the first type of query (Q1), all
the words in S are effectively declared to belong to the same source
language, whereas in the second type of query (Q2) the source language for
the words in S is left unspecified (in this case S may contain words from
various languages, or words--like "piano" or "chat" in the example
above--that belong simultaneously to different languages, and which may (e.g.
piano) or may not (e.g. chat) have the same zipk [concept key] for each of
these languages).

So, regarding your question about (Q1**) and (Q2**):

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )
 WHERE a1.type = ;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )

...the inner join with S is intended to pick out all the records in the
source table (either T in Q1** or T in Q2**) corresponding to words in
S, while the second (left) join, is there to find all the "translations" in
the target language.  I use a left join so that even those words in S for
which no translations exist will show up in the query results.

3. Why not write:
>
> CREATE VIEW txt AS
>   SELECT a1.word AS word1, a1.type AS type1, a2.word AS word2, a2.type AS
> type2
> FROM T a1 [LEFT] JOIN T a2 USING( zipk );  -- Use "LEFT" if
> appropriate
> SELECT word1, word1
>   FROM S JOIN txt ON word = word1
>   WHERE type1 =  AND type2 = ;
>

This is would indeed produce the same results as Q1, but this approach would
require defining about 10,000 views, one for each possible pair of int1 and
int2 (or pair of languages, to continue the multilingual dictionary
analogy), which freaks me out for some reason.  (Actually, the number of
such views would be many more than that, because in the actual application
there is not just one T but several dozen, similar to what would happen to
the schema in the multilingual dictionary analogy if we wanted to
pre-segregate the words according to some categories, say a T for animals, a
T for fruits, a T for verbs, a T for professions, etc.)

(I need to do a bit more work before I can post the EXPLAIN results.)

kynn

Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones

On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) <
[EMAIL PROTECTED]> wrote:

> On 2008-02-22 12:49, Kynn Jones wrote:
> > Of course, I expect that using views V and V... would
> > result in a loss in performance relative to a version that used bona
> > fide tables T and T.  My question is, how can I minimize
> > this performance loss?
>
> That used to be my thoughts too, but I have found over the years that
> the PostgreSQL execution planner is able to "flatten" SELECTs using
> VIEWs, ALMOST ALWAYS in a way that does not adversely affect
> performance, and often gives an IMPROVEMENT in performance, probably
> because by using VIEWs I am stating the query problem in a better way
> than if I try to guess the best way to optimize a SELECT.
>
> I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
> some query that is slow, I will try to rewrite it without VIEWs.  This
> ALMOST NEVER results in an improvement in performance, and when it does,
> I am able to find another way to write the VIEW and SELECT to recapture
> the gain.

Since you have experience working with views, let me ask you this.  The
converse strategy to the one I described originally would be to create the
individual tables T1, T2, T3, ..., T100, but instead of keeping around the
original (and now redundant) table T, replace it with a view V made up of
the union of T1, T2, T3, ..., T100.  The problem with this alternative is
that one cannot index V, or define a primary key constraint for it, because
it's a view.  This means that a search in V, even for a primary key value,
would be *have to be* very inefficient (i.e. I don't see how even the very
clever PostgreSQL implementers could get around this one!), because the
engine would have to search *all* the underlying tables, T1 through T100,
even if it found the desired record in T1, since it has no way of knowing
that the value is unique all across V.

Is there a way around this?

kynn

Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones

On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) <
[EMAIL PROTECTED]> wrote:

> On 2008-02-22 12:49, Kynn Jones wrote:
> > Of course, I expect that using views V and V... would
> > result in a loss in performance relative to a version that used bona
> > fide tables T and T.  My question is, how can I minimize
> > this performance loss?
>
> That used to be my thoughts too, but I have found over the years that
> the PostgreSQL execution planner is able to "flatten" SELECTs using
> VIEWs, ALMOST ALWAYS in a way that does not adversely affect
> performance, and often gives an IMPROVEMENT in performance, probably
> because by using VIEWs I am stating the query problem in a better way
> than if I try to guess the best way to optimize a SELECT.

Well, the last consideration you mention there does not apply to the two
alternatives I'm comparing because they differ only in that one uses views
V1, V2, V3, ... , V100 where the other one uses the corresponding tables T1,
T2, T3, ... , T100, so the query statements would be identical in both
cases.

> I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
> some query that is slow, I will try to rewrite it without VIEWs.  This
> ALMOST NEVER results in an improvement in performance...

That's truly amazing!  Just to make sure I get you right, you're saying that
when you replace a view by its equivalent table you see no performance gain?
 How could it be?  With views every query entails the additional work of
searching the underlying tables for the records that make up the views...

OK, if I think a bit more about it I suppose that a view could be
implemented for performance as a special sort of table consisting of a
single column of pointers to the "true" records, in which case using views
would entail only the cost of this indirection, and not the cost of a
search...  (And also the cost of maintaining this pointer table, if the
underlying tables are mutable.)  So I guess views could be implemented in
such a way that the difference in SELECT performance relative to replacing
them with tables would be negligible...

Anyway, your post once again reminded me of awesomeness of PostgreSQL.
 Props to the developers!

kynn

[PERFORM] Q on views and performance

2008-02-22 Thread Kynn Jones

Hi.  I'm trying to optimize the performance of a database whose main purpose
is to support two (rather similar) kinds of queries.  The first kind, which
is expected to be the most common (I estimate it will account for about 90%
of all the queries performed on this DB), has the following general
structure:

(Q1)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a1.type = 
  AND a2.type = ;

...where  and  stand for some two integers.  In English, this
query essentially executes an inner join between two "virtual subtables" of
table T, which are defined by the value of the type column.  For brevity, I
will refer to these (virtual) subtables as T and T.  (I should
point out that T holds about 2 million records, spread roughly evenly over
about 100 values of the type column.  So each of these virtual subtables has
about 20K records.  Also, for practical purposes T may be regarded as an
immutable, read-only table, since it gets re-built from scratch about once a
month.  And, FWIW, all the columns mentioned in this post have a NOT
NULLconstraint.)

The second form is similar to the first, except that now the join is taken
between T and T:

(Q2)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a2.type = ;

(Both the forms above are somewhat oversimplified relative to the actual
situation; in our actual application, the joins are actually left outer
ones, and each query also involves an additional inner join with another
table, S.  For the sake of completeness, I give the "real-world" versions of
these queries at the end of this post, but I think that for the purpose of
my question, the additional complications they entail can be neglected.)

One way to speed (Q1) would be to break T into its subtables, i.e. to create
T1, T2, T3, ... , T100 as bona fide tables.  Then the query would become a
simple join without the two condition of the original's WHERE clause, which
I figure should make it noticeably faster.

But since the second kind of query (Q2) requires T, we can't get rid of this
table, so all the data would need to be stored twice, once in T and once in
some T.

In trying to come up with a way around this duplication, it occurred to me
that instead of creating tables T1, T2, etc., I could create the analogous
views V1, V2, etc.  (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1).
 With this design, the two queries above would become

(Q1*)  SELECT V.word, V.word
 FROM V JOIN V USING ( zipk );

(Q2*)  SELECT T.word, V.word
 FROM T JOIN V USING ( zipk );

Of course, I expect that using views V and V... would result in
a loss in performance relative to a version that used bona fide tables
T and T.  My question is, how can I minimize this performance
loss?

More specifically, how can I go about building table T and the views
V's to maximize the performance of (Q1)?  For example, I'm thinking
that if T had an additional id column and were built in such a way that all
the records belonging to each V were physically contiguous, and (say)
had contiguous values in the id column, then I could define each view like
this

  CREATE VIEW V AS SELECT * FROM T
   WHERE  <= id AND id < ;

So my question is, what design would make querying V1, V2, V3 ... as fast as
possible?  Is it possible to approach the performance of the design that
uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?

Thank you very much for reading this long post, and many thanks in advance
for your comments!

Kynn


P.S.  Here are the actual form of the queries.  They now include an initial
join with table S, and the join with T (or V) is a left outer
join.  Interestingly, even though the queries below that use views (i.e.
Q1*** and Q2***) are not much more complex-looking than before, the other
two (Q1** and Q2**) are.  I don't know if this is because my ineptitude with
SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the
subquery sq.

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )
 WHERE a1.type = ;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )

   -

(Q1***) SELECT V.word, V.word FROM
   S  JOIN V USING ( word )
 LEFT JOIN V USING ( zipk );

(Q2***) SELECT T.word, V.word
  FROM S  JOIN T   USING ( word )
 LEFT JOIN V USING ( zipk );

[PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Kynn Jones


Consider these two very similar schemas:

Schema 1:


CREATE TABLE foo (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) NOT NULL UNIQUE
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 foo_id int REFERENCES foo(id)
)


Schema 2:


CREATE TABLE foo (
 frobnitz character(varying 100) PRIMARY KEY
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) REFERENCES foo(frobnitz)
)




The two situations are semantically identical: each record in table bar
refers to a record in table foo.  The difference is that in the first
schema, this referencing is done through an "artificial" serial-integer
primary key, while in the second schema this reference is done through a
data field that happens to be unique and not null, so it can serve as
primary key.


I find Schema 1 awkward and unnatural; more specifically, foo.id seems
unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
remember once reading that "long" fields like foo.frobnitz did not make good
primary keys.


Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
referencing from bar to foo really be done "behind the scenes" through some
hidden field (oid?) instead of through the frobnitz text field?  Which of
the two schemas would give better perfornance?


Thanks!


kj

Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread Kynn Jones


On 5/24/06, Tom Lane <[EMAIL PROTECTED]> wrote: 
<[EMAIL PROTECTED]> writes:>  Limit  (cost=19676.75..21327.99
 rows=6000 width=84)>->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)>  Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))>  ->  Seq Scan on huge_table h  (cost=
0.00..51292.43 rows=2525543 width=46)>  ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)>->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?  There'ssomething wacko about your statistics, methinks.
 
 
You mean there's a bug in explain?  I agree that it makes no sense that the costs don't differ as much as one would expect, but you can see right there the numbers of rows for the two tables, which are exactly as I described.  At any rate, how would one go about finding an explanation for these strange stats?

 
More bewildering still (and infuriating as hell--because it means that all of my work for yesterday has been wasted) is that I can no longer reproduce the best query plan, even though the tables have not changed at all.  (Hence I can't post the explain analyze for the best query plan.)  No matter what value I use for LIMIT, the query planner now insists on sequentially scanning huge_table and ignoring the available index.  (If I turn off enable_seqscan, I get the second worst query plan I posted yesterday.)

 
Anyway, I take it that there is no way to bypass the optimizer and instruct PostgreSQL exactly how one wants the search performed?
 
Thanks!
 
kj

Re: [PERFORM] How to "unique-ify" HUGE table?

[PERFORM] How to "unique-ify" HUGE table?

Re: [PERFORM] How to profile an "SQL script"?

[PERFORM] How to profile an "SQL script"?

[PERFORM] Performance and IN clauses

Re: [PERFORM] The "many nulls" problem

Re: [PERFORM] The "many nulls" problem

[PERFORM] The "many nulls" problem

Re: [PERFORM] Joins and DELETE FROM

Re: [PERFORM] Joins and DELETE FROM

[PERFORM] Joins and DELETE FROM

Re: [PERFORM] Q on views and performance

Re: [PERFORM] Q on views and performance

Re: [PERFORM] Q on views and performance

Re: [PERFORM] Q on views and performance

Re: [PERFORM] Q on views and performance

[PERFORM] Q on views and performance

[PERFORM] Basic Q on superfluous primary keys

Re: [PERFORM] Optimizing a huge_table/tiny_table join

19 matches

Site Navigation

Mail list logo

Footer information