Re: [PERFORM] store A LOT of 3-tuples for comparisons

2008-02-23 Thread Moritz Onken



I am also wondering about the ordering and whether that matters.
Can you have he, can, drink as well as drink, he, can
and should they be considered the same? If so you will need a  
different tactic.




ordering matters. So the 3-column tactic should work.

Thanks for your advice!

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) 
[EMAIL PROTECTED] wrote:

 On 2008-02-22 12:49, Kynn Jones wrote:
  Of course, I expect that using views Vint1 and Vint2... would
  result in a loss in performance relative to a version that used bona
  fide tables Tint1 and Tint2.  My question is, how can I minimize
  this performance loss?

 That used to be my thoughts too, but I have found over the years that
 the PostgreSQL execution planner is able to flatten SELECTs using
 VIEWs, ALMOST ALWAYS in a way that does not adversely affect
 performance, and often gives an IMPROVEMENT in performance, probably
 because by using VIEWs I am stating the query problem in a better way
 than if I try to guess the best way to optimize a SELECT.


Well, the last consideration you mention there does not apply to the two
alternatives I'm comparing because they differ only in that one uses views
V1, V2, V3, ... , V100 where the other one uses the corresponding tables T1,
T2, T3, ... , T100, so the query statements would be identical in both
cases.


 I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
 some query that is slow, I will try to rewrite it without VIEWs.  This
 ALMOST NEVER results in an improvement in performance...


That's truly amazing!  Just to make sure I get you right, you're saying that
when you replace a view by its equivalent table you see no performance gain?
 How could it be?  With views every query entails the additional work of
searching the underlying tables for the records that make up the views...

OK, if I think a bit more about it I suppose that a view could be
implemented for performance as a special sort of table consisting of a
single column of pointers to the true records, in which case using views
would entail only the cost of this indirection, and not the cost of a
search...  (And also the cost of maintaining this pointer table, if the
underlying tables are mutable.)  So I guess views could be implemented in
such a way that the difference in SELECT performance relative to replacing
them with tables would be negligible...

Anyway, your post once again reminded me of awesomeness of PostgreSQL.
 Props to the developers!

kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) 
[EMAIL PROTECTED] wrote:

 On 2008-02-22 12:49, Kynn Jones wrote:
  Of course, I expect that using views Vint1 and Vint2... would
  result in a loss in performance relative to a version that used bona
  fide tables Tint1 and Tint2.  My question is, how can I minimize
  this performance loss?

 That used to be my thoughts too, but I have found over the years that
 the PostgreSQL execution planner is able to flatten SELECTs using
 VIEWs, ALMOST ALWAYS in a way that does not adversely affect
 performance, and often gives an IMPROVEMENT in performance, probably
 because by using VIEWs I am stating the query problem in a better way
 than if I try to guess the best way to optimize a SELECT.

 I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with
 some query that is slow, I will try to rewrite it without VIEWs.  This
 ALMOST NEVER results in an improvement in performance, and when it does,
 I am able to find another way to write the VIEW and SELECT to recapture
 the gain.


Since you have experience working with views, let me ask you this.  The
converse strategy to the one I described originally would be to create the
individual tables T1, T2, T3, ..., T100, but instead of keeping around the
original (and now redundant) table T, replace it with a view V made up of
the union of T1, T2, T3, ..., T100.  The problem with this alternative is
that one cannot index V, or define a primary key constraint for it, because
it's a view.  This means that a search in V, even for a primary key value,
would be *have to be* very inefficient (i.e. I don't see how even the very
clever PostgreSQL implementers could get around this one!), because the
engine would have to search *all* the underlying tables, T1 through T100,
even if it found the desired record in T1, since it has no way of knowing
that the value is unique all across V.

Is there a way around this?

kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Dean Gibson (DB Administrator)

On 2008-02-23 05:59, Kynn Jones wrote:
On Fri, Feb 22, 2008 at 8:48 PM, Dean Gibson (DB Administrator) 
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:


...


Since you have experience working with views, let me ask you this. 
 The converse strategy to the one I described originally would be to 
create the individual tables T1, T2, T3, ..., T100, but instead of 
keeping around the original (and now redundant) table T, replace it 
with a view V made up of the union of T1, T2, T3, ..., T100.  The 
problem with this alternative is that one cannot index V, or define a 
primary key constraint for it, because it's a view.  This means that a 
search in V, even for a primary key value, would be *have to be* very 
inefficient (i.e. I don't see how even the very clever PostgreSQL 
implementers could get around this one!), because the engine would 
have to search *all* the underlying tables, T1 through T100, even if 
it found the desired record in T1, since it has no way of knowing that 
the value is unique all across V.


Is there a way around this?

kynn

Oh, I wouldn't create separate tables and do a UNION of them, I'd think 
that would be inefficient.


I didn't look in detail at your previous eMail, but I will now:

1. You haven't told us the distribution of zipk, or what the tables 
are indexed on, or what type of performance you are expecting.  Your 
initial examples don't help much unless you actually have performance 
numbers or EXPLAIN output for them, since adding the third JOIN 
significantly changes the picture, as does changing one of the JOINs to 
a LEFT JOIN.


2. In your actual (Q1** and Q2**) examples, why is one JOIN an INNER 
JOIN and the other one a LEFT JOIN?  Given your description of Q1 at the 
top of your message, that doesn't make sense to me.


3. Why not write:

CREATE VIEW txt AS
 SELECT a1.word AS word1, a1.type AS type1, a2.word AS word2, a2.type 
AS type2

   FROM T a1 [LEFT] JOIN T a2 USING( zipk );  -- Use LEFT if appropriate
SELECT word1, word1
 FROM S JOIN txt ON word = word1
 WHERE type1 = int1 AND type2 = int2;

If either of those (either with or without the LEFT) are not 
equivalent to your problem, how about just:


SELECT a1.word AS word1, a2.word AS word2
 FROM S JOIN T a1 USING( word)
   [LEFT] JOIN T a2 USING( zipk )  -- Use LEFT if appropriate
 WHERE a1.type = int1 AND a2.type = int2;

Show us (using EXPLAIN) what the query planner thinks of each of these.

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.



Re: [PERFORM] Q on views and performance

2008-02-23 Thread Dean Gibson (DB Administrator)

On 2008-02-23 07:08, Dean Gibson (DB Administrator) wrote:

...


SELECT word1, word1
  FROM S JOIN txt ON word = word1
  WHERE type1 = int1 AND type2 = int2;

...

Oops that should be:

SELECT word1, word2
 FROM S JOIN txt ON word = word1
 WHERE type1 = int1 AND type2 = int2;


--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.



Re: [PERFORM] Q on views and performance

2008-02-23 Thread Kynn Jones
Hi, Dean.  The system I'm working with is very similar in spirit to a
large multilingual dictionary covering 100 languages.  Using this analogy,
the type column would correspond to the language, and the zipk column
would correspond to some language-independent key associated with a concept
(concept key for short).  So, if it were indeed a multilingual dictionary,
records in T would look like
  word   | zipk | language
-+--+---
 house   | 1234 | english
 casa| 1234 | spanish
 haus| 1234 | german
 piano   | 2345 | english
 piano   | 2345 | spanish
 cat | 3456 | english
 chat| 3456 | french
 chat| 4567 | english
 plausch | 4567 | german

...where I used the notation lang to denote the integer id assigned to
language lang.  Therefore typically there are about 100 records in T for
any given zipk, one for each language.  But the correspondence is not
perfect, since, for example, some languages have, proverbially, more than
one word for snow, and some (maybe from some tropical island in the South
Pacific) have none.  (This last case, BTW, is what accounts for the use of
left joins, as will become clear in a minute.)

The table S can be thought of a table consisting of a collection of words to
be translated to some target language.  In the first type of query (Q1), all
the words in S are effectively declared to belong to the same source
language, whereas in the second type of query (Q2) the source language for
the words in S is left unspecified (in this case S may contain words from
various languages, or words--like piano or chat in the example
above--that belong simultaneously to different languages, and which may (e.g.
piano) or may not (e.g. chat) have the same zipk [concept key] for each of
these languages).

So, regarding your question about (Q1**) and (Q2**):

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )
 WHERE a1.type = int1;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type = int2 ) sq USING ( zipk )

...the inner join with S is intended to pick out all the records in the
source table (either Tint1 in Q1** or T in Q2**) corresponding to words in
S, while the second (left) join, is there to find all the translations in
the target language.  I use a left join so that even those words in S for
which no translations exist will show up in the query results.

3. Why not write:

 CREATE VIEW txt AS
   SELECT a1.word AS word1, a1.type AS type1, a2.word AS word2, a2.type AS
 type2
 FROM T a1 [LEFT] JOIN T a2 USING( zipk );  -- Use LEFT if
 appropriate
 SELECT word1, word1
   FROM S JOIN txt ON word = word1
   WHERE type1 = int1 AND type2 = int2;


This is would indeed produce the same results as Q1, but this approach would
require defining about 10,000 views, one for each possible pair of int1 and
int2 (or pair of languages, to continue the multilingual dictionary
analogy), which freaks me out for some reason.  (Actually, the number of
such views would be many more than that, because in the actual application
there is not just one T but several dozen, similar to what would happen to
the schema in the multilingual dictionary analogy if we wanted to
pre-segregate the words according to some categories, say a T for animals, a
T for fruits, a T for verbs, a T for professions, etc.)

(I need to do a bit more work before I can post the EXPLAIN results.)

kynn


Re: [PERFORM] Q on views and performance

2008-02-23 Thread Dean Gibson (DB Administrator)

On 2008-02-23 08:21, Kynn Jones wrote:

...

3. Why not write:

CREATE VIEW txt AS
  SELECT a1.word AS word1, a1.type AS type1, a2.word AS word2,
a2.type AS type2
FROM T a1 [LEFT] JOIN T a2 USING( zipk );  -- Use LEFT if
appropriate
SELECT word1, word1
  FROM S JOIN txt ON word = word1
  WHERE type1 = int1 AND type2 = int2;


This is would indeed produce the same results as Q1, but this approach 
would require defining about 10,000 views, one for each possible pair 
of int1 and int2


Why 10,000 views???  What's wrong with the ONE view above?  You DON'T 
want to be defining VIEWs based on actual tables VALUES;  leave that to 
the SELECT.  For that matter, what's wrong with the final SELECT I 
listed (below)?


SELECT a1.word AS word1, a2.word AS word2
 FROM S JOIN T a1 USING( word )
   LEFT JOIN T a2 USING( zipk )
 WHERE a1.type = int1 AND a2.type = int2;

-- Dean

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.



Re: [PERFORM] Q on views and performance

2008-02-23 Thread Dean Gibson (DB Administrator)

On 2008-02-23 08:49, Dean Gibson (DB Administrator) wrote:
Why 10,000 views???  What's wrong with the ONE view above?  You DON'T 
want to be defining VIEWs based on actual tables VALUES;  leave that 
to the SELECT.  For that matter, what's wrong with the final SELECT I 
listed (below)?


SELECT a1.word AS word1, a2.word AS word2
  FROM S JOIN T a1 USING( word )
LEFT JOIN T a2 USING( zipk )
  WHERE a1.type = int1 AND a2.type = int2;

-- Dean

Amendment:  I forgot, that if it's a LEFT JOIN you have to write it as:

SELECT a1.word AS word1, a2.word AS word2
 FROM S JOIN T a1 USING( word )
   LEFT JOIN T a2 USING( zipk )
 WHERE a1.type = int1 AND (a2.type = int2 OR a2.type IS NULL);

-- Dean

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.



[PERFORM] LISTEN / NOTIFY performance in 8.3

2008-02-23 Thread Joel Stevenson

Hi,

I'm noticing a strange increase in the amount of time it takes to 
issue a NOTIFY statement.


I have an existing app that provides a producer / consumer type of 
queue and that uses the LISTEN / NOTIFY mechanism to signal the 
consumers of new items arriving in the queue.  The consumers then 
process these items and issue a notify to signal that they have been 
processed.  In the past issuing these notifications happened very 
quickly, now on 8.3 I'm seeing all of them taking over 300ms and many 
of them taking 1500ms or more!  The notifications are happening 
outside of any transactions (which is itself a probable area for 
performance improvement, I realize) but I'm wondering what might have 
changed between 8.1 (the version I was using in the past) and 8.3?


TIA,
Joel

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] LISTEN / NOTIFY performance in 8.3

2008-02-23 Thread Tom Lane
Joel Stevenson [EMAIL PROTECTED] writes:
 I have an existing app that provides a producer / consumer type of 
 queue and that uses the LISTEN / NOTIFY mechanism to signal the 
 consumers of new items arriving in the queue.  The consumers then 
 process these items and issue a notify to signal that they have been 
 processed.  In the past issuing these notifications happened very 
 quickly, now on 8.3 I'm seeing all of them taking over 300ms and many 
 of them taking 1500ms or more!

That's strange, I would not have thought that listen/notify behavior
would change at all.  How are you measuring this delay exactly?
Can you put together a self-contained test case?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] LISTEN / NOTIFY performance in 8.3

2008-02-23 Thread Joel Stevenson

At 11:58 PM -0500 2/23/08, Tom Lane wrote:

Joel Stevenson [EMAIL PROTECTED] writes:

 That's strange, I would not have thought that listen/notify behavior
 would change at all.  How are you measuring this delay exactly?
 Can you put together a self-contained test case?



 Attached is a perl script that sort of simulates what's going on.


Thanks for the script.  It's not showing any particular problems here,
though.  With log_min_duration_statement = 10, the only statements that
(slightly) exceed 10ms are the select count(*) from generate_series(1,
15000) ones.


 Also of note, the iowait percentages on this quad core linux box jump
 to 30-40% while this test script is running, event though there's no
 table activity involved and the producer consumers pause for up to a
 second between iterations.


This sounds a bit like pg_listener has gotten bloated.  Try a VACUUM
VERBOSE pg_listener (as superuser) and see what it says.


At the moment (server is inactive):

pcdb=# VACUUM VERBOSE pg_listener;
INFO:  vacuuming pg_catalog.pg_listener
INFO:  pg_listener: removed 1 row versions in 1 pages
INFO:  pg_listener: found 1 removable, 21 nonremovable row versions 
in 28 pages

DETAIL:  0 dead row versions cannot be removed yet.
There were 2319 unused item pointers.
28 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

running the test script and then the above command:

pcdb=# VACUUM VERBOSE pg_listener;
INFO:  vacuuming pg_catalog.pg_listener
INFO:  pg_listener: removed 693 row versions in 12 pages
INFO:  pg_listener: found 693 removable, 21 nonremovable row 
versions in 28 pages

DETAIL:  0 dead row versions cannot be removed yet.
There were 2308 unused item pointers.
28 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

Numerous notifications took 1000ms or so to complete in the test 
script execution between those two vacuum runs.



If that is the problem then the next question is why it got so much more
bloated than you were used to --- something wrong with vacuuming
procedures, perhaps?


I have autovacuum on and using default settings.  I have an explicit 
vacuum routine that runs nightly over the whole DB.


-Joel

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq