Re: [PERFORM] store A LOT of 3-tuples for comparisons

2008-02-22 Thread Shane Ambler

Matthew wrote:

On Fri, 22 Feb 2008, Moritz Onken wrote:


I thought of doing all the inserts without having an index and without 
doing the check whether the row is already there. After that I'd do a 
"group by" and count(*) on that table. Is this a good idea?


That sounds like the fastest way to do it, certainly.


Yeah I would load the data into a temp 3-column table and then
INSERT INTO mydatatable SELECT w1,w2,w3,count(*) GROUP BY w1,w2,w3
then
CREATE UNIQUE INDEX idx_unique_data ON mydatatable (w1,w2,w3)
if you plan to continue adding to and using the data.

If this is to be an ongoing data collection (with data being added 
slowly from here) I would probably setup a trigger to update the count 
column.



I am also wondering about the ordering and whether that matters.
Can you have "he", "can", "drink" as well as "drink", "he", "can"
and should they be considered the same? If so you will need a different 
tactic.




--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Q on views and performance

2008-02-22 Thread Dean Gibson (DB Administrator)

On 2008-02-22 12:49, Kynn Jones wrote:
Of course, I expect that using views V and V... would 
result in a loss in performance relative to a version that used bona 
fide tables T and T.  My question is, how can I minimize 
this performance loss?


That used to be my thoughts too, but I have found over the years that 
the PostgreSQL execution planner is able to "flatten" SELECTs using 
VIEWs, ALMOST ALWAYS in a way that does not adversely affect 
performance, and often gives an IMPROVEMENT in performance, probably 
because by using VIEWs I am stating the query problem in a better way 
than if I try to guess the best way to optimize a SELECT.


I have at least a 10:1 ratio of VIEWs to TABLEs.  Occasionally, with 
some query that is slow, I will try to rewrite it without VIEWs.  This 
ALMOST NEVER results in an improvement in performance, and when it does, 
I am able to find another way to write the VIEW and SELECT to recapture 
the gain.


-- Dean

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Q on views and performance

2008-02-22 Thread Kynn Jones
Hi.  I'm trying to optimize the performance of a database whose main purpose
is to support two (rather similar) kinds of queries.  The first kind, which
is expected to be the most common (I estimate it will account for about 90%
of all the queries performed on this DB), has the following general
structure:

(Q1)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a1.type = 
  AND a2.type = ;

...where  and  stand for some two integers.  In English, this
query essentially executes an inner join between two "virtual subtables" of
table T, which are defined by the value of the type column.  For brevity, I
will refer to these (virtual) subtables as T and T.  (I should
point out that T holds about 2 million records, spread roughly evenly over
about 100 values of the type column.  So each of these virtual subtables has
about 20K records.  Also, for practical purposes T may be regarded as an
immutable, read-only table, since it gets re-built from scratch about once a
month.  And, FWIW, all the columns mentioned in this post have a NOT
NULLconstraint.)

The second form is similar to the first, except that now the join is taken
between T and T:

(Q2)   SELECT a1.word, a2.word
 FROM T a1 JOIN T a2 USING ( zipk )
WHERE a2.type = ;

(Both the forms above are somewhat oversimplified relative to the actual
situation; in our actual application, the joins are actually left outer
ones, and each query also involves an additional inner join with another
table, S.  For the sake of completeness, I give the "real-world" versions of
these queries at the end of this post, but I think that for the purpose of
my question, the additional complications they entail can be neglected.)

One way to speed (Q1) would be to break T into its subtables, i.e. to create
T1, T2, T3, ... , T100 as bona fide tables.  Then the query would become a
simple join without the two condition of the original's WHERE clause, which
I figure should make it noticeably faster.

But since the second kind of query (Q2) requires T, we can't get rid of this
table, so all the data would need to be stored twice, once in T and once in
some T.

In trying to come up with a way around this duplication, it occurred to me
that instead of creating tables T1, T2, etc., I could create the analogous
views V1, V2, etc.  (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1).
 With this design, the two queries above would become

(Q1*)  SELECT V.word, V.word
 FROM V JOIN V USING ( zipk );

(Q2*)  SELECT T.word, V.word
 FROM T JOIN V USING ( zipk );

Of course, I expect that using views V and V... would result in
a loss in performance relative to a version that used bona fide tables
T and T.  My question is, how can I minimize this performance
loss?

More specifically, how can I go about building table T and the views
V's to maximize the performance of (Q1)?  For example, I'm thinking
that if T had an additional id column and were built in such a way that all
the records belonging to each V were physically contiguous, and (say)
had contiguous values in the id column, then I could define each view like
this

  CREATE VIEW V AS SELECT * FROM T
   WHERE  <= id AND id < ;

So my question is, what design would make querying V1, V2, V3 ... as fast as
possible?  Is it possible to approach the performance of the design that
uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?

Thank you very much for reading this long post, and many thanks in advance
for your comments!

Kynn


P.S.  Here are the actual form of the queries.  They now include an initial
join with table S, and the join with T (or V) is a left outer
join.  Interestingly, even though the queries below that use views (i.e.
Q1*** and Q2***) are not much more complex-looking than before, the other
two (Q1** and Q2**) are.  I don't know if this is because my ineptitude with
SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the
subquery sq.

(Q1**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )
 WHERE a1.type = ;

(Q2**)  SELECT a1.word, sq.word FROM
   S  JOIN T a1 USING ( word )
 LEFT JOIN ( SELECT * FROM T a2
 WHERE a2.type =  ) sq USING ( zipk )

   -

(Q1***) SELECT V.word, V.word FROM
   S  JOIN V USING ( word )
 LEFT JOIN V USING ( zipk );

(Q2***) SELECT T.word, V.word
  FROM S  JOIN T   USING ( word )
 LEFT JOIN V USING ( zipk );


[PERFORM] loading same instance of dump to two different servers simultaneously?

2008-02-22 Thread Susan Russo

Hi -
I'm wondering if anyone has had success doing a simultaneous
load of one Pg dump to two different servers?  The load command
is actually run from two different workstations, but reading the
same pgdump-file.

We use this command from the command line (Solaris-10 OS):

uncompress -c pgdump-filename.Z | psql -h pgserver-A  pg-dbname

and, likewise wonder if we can run the same command on another
workstation, but reading the SAME 'pgdump-filename.Z'
to load onto ANOTHER server ('pgserver-B'), i.e.:

uncompress -c pgdump-filename.Z | psql -h pgserver-A  pg-dbname

Thanks for any advice.
Susan Russo


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] CORRECTION to msg 'loading same instance of dump to two different servers simultaneously'

2008-02-22 Thread Susan Russo

SORRY -

these are the commands (i.e. pgserver-A  and pgserver-B)

==

Hi -
I'm wondering if anyone has had success doing a simultaneous
load of one Pg dump to two different servers?  The load command
is actually run from two different workstations, but reading the
same pgdump-file.

We use this command from the command line (Solaris-10 OS):

uncompress -c pgdump-filename.Z | psql -h pgserver-A  pg-dbname

and, likewise wonder if we can run the same command on another
workstation, but reading the SAME 'pgdump-filename.Z'
to load onto ANOTHER server ('pgserver-B'), i.e.:

uncompress -c pgdump-filename.Z | psql -h pgserver-B pg-dbname

=
S


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] store A LOT of 3-tuples for comparisons

2008-02-22 Thread Matthew

On Fri, 22 Feb 2008, Moritz Onken wrote:
I need to store a lot of 3-tuples of words (e.g. "he", "can", "drink"), order 
matters!

The source is about 4 GB of these 3-tuples.
I need to store them in a table and check whether one of them is already 
stored, and if that's the case to increment a column named "count" (or 
something).


My suggestion would be to use three varchar columns to store the 3-tuples. 
You should then create a B-tree index on the three columns together.


I thought of doing all the inserts without having an index and without doing 
the check whether the row is already there. After that I'd do a "group by" 
and count(*) on that table. Is this a good idea?


That sounds like the fastest way to do it, certainly.

Matthew

--
"We have always been quite clear that Win95 and Win98 are not the systems to
use if you are in a hostile security environment." "We absolutely do recognize
that the Internet is a hostile environment." Paul Leach <[EMAIL PROTECTED]>

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[PERFORM] store A LOT of 3-tuples for comparisons

2008-02-22 Thread Moritz Onken

Hi,

I need to store a lot of 3-tuples of words (e.g. "he", "can",  
"drink"), order matters!

The source is about 4 GB of these 3-tuples.
I need to store them in a table and check whether one of them is  
already stored, and if that's the case to increment a column named  
"count" (or something).


I thought of doing all the inserts without having an index and without  
doing the check whether the row is already there. After that I'd do a  
"group by" and count(*) on that table. Is this a good idea?


I don't know much about Pgs data types. I'd try to use the varchar  
type. But maybe there is a better data type?

What kind of index should I use?

This is for a scientific research.

Thanks in advance

moritz


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] 7 hrs for a pg_restore?

2008-02-22 Thread Guillaume Cottenceau
Tom Lane  writes:

> Guillaume Cottenceau <[EMAIL PROTECTED]> writes:
>> I have made a comparison restoring a production dump with default
>> and large maintenance_work_mem. The speedup improvement here is
>> only of 5% (12'30 => 11'50).
>
>> Apprently, on the restored database, data is 1337 MB[1] and
>> indexes 644 MB[2][2]. Pg is 8.2.3, checkpoint_segments 3,
>> maintenance_work_mem default (16MB) then 512MB, shared_buffers
>> 384MB. It is rather slow disks (Dell's LSI Logic RAID1), hdparm
>> reports 82 MB/sec for reads.
>
> The main thing that jumps out at me is that boosting checkpoint_segments
> would probably help.  I tend to set it to 30 or so (note that this
> corresponds to about 1GB taken up by pg_xlog).

Interestingly, from a bzipped dump, there is no win; however,
from an uncompressed dump, increasing checkpoint_segments from 3
to 30 decreases clock time from 9'50 to 8'30 (15% if I'm
correct).

-- 
Guillaume Cottenceau

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org