Re: [PERFORM] store A LOT of 3-tuples for comparisons
Matthew wrote: On Fri, 22 Feb 2008, Moritz Onken wrote: I thought of doing all the inserts without having an index and without doing the check whether the row is already there. After that I'd do a "group by" and count(*) on that table. Is this a good idea? That sounds like the fastest way to do it, certainly. Yeah I would load the data into a temp 3-column table and then INSERT INTO mydatatable SELECT w1,w2,w3,count(*) GROUP BY w1,w2,w3 then CREATE UNIQUE INDEX idx_unique_data ON mydatatable (w1,w2,w3) if you plan to continue adding to and using the data. If this is to be an ongoing data collection (with data being added slowly from here) I would probably setup a trigger to update the count column. I am also wondering about the ordering and whether that matters. Can you have "he", "can", "drink" as well as "drink", "he", "can" and should they be considered the same? If so you will need a different tactic. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Q on views and performance
On 2008-02-22 12:49, Kynn Jones wrote: Of course, I expect that using views V and V... would result in a loss in performance relative to a version that used bona fide tables T and T. My question is, how can I minimize this performance loss? That used to be my thoughts too, but I have found over the years that the PostgreSQL execution planner is able to "flatten" SELECTs using VIEWs, ALMOST ALWAYS in a way that does not adversely affect performance, and often gives an IMPROVEMENT in performance, probably because by using VIEWs I am stating the query problem in a better way than if I try to guess the best way to optimize a SELECT. I have at least a 10:1 ratio of VIEWs to TABLEs. Occasionally, with some query that is slow, I will try to rewrite it without VIEWs. This ALMOST NEVER results in an improvement in performance, and when it does, I am able to find another way to write the VIEW and SELECT to recapture the gain. -- Dean -- Mail to my list address MUST be sent via the mailing list. All other mail to my list address will bounce. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Q on views and performance
Hi. I'm trying to optimize the performance of a database whose main purpose is to support two (rather similar) kinds of queries. The first kind, which is expected to be the most common (I estimate it will account for about 90% of all the queries performed on this DB), has the following general structure: (Q1) SELECT a1.word, a2.word FROM T a1 JOIN T a2 USING ( zipk ) WHERE a1.type = AND a2.type = ; ...where and stand for some two integers. In English, this query essentially executes an inner join between two "virtual subtables" of table T, which are defined by the value of the type column. For brevity, I will refer to these (virtual) subtables as T and T. (I should point out that T holds about 2 million records, spread roughly evenly over about 100 values of the type column. So each of these virtual subtables has about 20K records. Also, for practical purposes T may be regarded as an immutable, read-only table, since it gets re-built from scratch about once a month. And, FWIW, all the columns mentioned in this post have a NOT NULLconstraint.) The second form is similar to the first, except that now the join is taken between T and T: (Q2) SELECT a1.word, a2.word FROM T a1 JOIN T a2 USING ( zipk ) WHERE a2.type = ; (Both the forms above are somewhat oversimplified relative to the actual situation; in our actual application, the joins are actually left outer ones, and each query also involves an additional inner join with another table, S. For the sake of completeness, I give the "real-world" versions of these queries at the end of this post, but I think that for the purpose of my question, the additional complications they entail can be neglected.) One way to speed (Q1) would be to break T into its subtables, i.e. to create T1, T2, T3, ... , T100 as bona fide tables. Then the query would become a simple join without the two condition of the original's WHERE clause, which I figure should make it noticeably faster. But since the second kind of query (Q2) requires T, we can't get rid of this table, so all the data would need to be stored twice, once in T and once in some T. In trying to come up with a way around this duplication, it occurred to me that instead of creating tables T1, T2, etc., I could create the analogous views V1, V2, etc. (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1). With this design, the two queries above would become (Q1*) SELECT V.word, V.word FROM V JOIN V USING ( zipk ); (Q2*) SELECT T.word, V.word FROM T JOIN V USING ( zipk ); Of course, I expect that using views V and V... would result in a loss in performance relative to a version that used bona fide tables T and T. My question is, how can I minimize this performance loss? More specifically, how can I go about building table T and the views V's to maximize the performance of (Q1)? For example, I'm thinking that if T had an additional id column and were built in such a way that all the records belonging to each V were physically contiguous, and (say) had contiguous values in the id column, then I could define each view like this CREATE VIEW V AS SELECT * FROM T WHERE <= id AND id < ; So my question is, what design would make querying V1, V2, V3 ... as fast as possible? Is it possible to approach the performance of the design that uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...? Thank you very much for reading this long post, and many thanks in advance for your comments! Kynn P.S. Here are the actual form of the queries. They now include an initial join with table S, and the join with T (or V) is a left outer join. Interestingly, even though the queries below that use views (i.e. Q1*** and Q2***) are not much more complex-looking than before, the other two (Q1** and Q2**) are. I don't know if this is because my ineptitude with SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the subquery sq. (Q1**) SELECT a1.word, sq.word FROM S JOIN T a1 USING ( word ) LEFT JOIN ( SELECT * FROM T a2 WHERE a2.type = ) sq USING ( zipk ) WHERE a1.type = ; (Q2**) SELECT a1.word, sq.word FROM S JOIN T a1 USING ( word ) LEFT JOIN ( SELECT * FROM T a2 WHERE a2.type = ) sq USING ( zipk ) - (Q1***) SELECT V.word, V.word FROM S JOIN V USING ( word ) LEFT JOIN V USING ( zipk ); (Q2***) SELECT T.word, V.word FROM S JOIN T USING ( word ) LEFT JOIN V USING ( zipk );
[PERFORM] loading same instance of dump to two different servers simultaneously?
Hi - I'm wondering if anyone has had success doing a simultaneous load of one Pg dump to two different servers? The load command is actually run from two different workstations, but reading the same pgdump-file. We use this command from the command line (Solaris-10 OS): uncompress -c pgdump-filename.Z | psql -h pgserver-A pg-dbname and, likewise wonder if we can run the same command on another workstation, but reading the SAME 'pgdump-filename.Z' to load onto ANOTHER server ('pgserver-B'), i.e.: uncompress -c pgdump-filename.Z | psql -h pgserver-A pg-dbname Thanks for any advice. Susan Russo ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] CORRECTION to msg 'loading same instance of dump to two different servers simultaneously'
SORRY - these are the commands (i.e. pgserver-A and pgserver-B) == Hi - I'm wondering if anyone has had success doing a simultaneous load of one Pg dump to two different servers? The load command is actually run from two different workstations, but reading the same pgdump-file. We use this command from the command line (Solaris-10 OS): uncompress -c pgdump-filename.Z | psql -h pgserver-A pg-dbname and, likewise wonder if we can run the same command on another workstation, but reading the SAME 'pgdump-filename.Z' to load onto ANOTHER server ('pgserver-B'), i.e.: uncompress -c pgdump-filename.Z | psql -h pgserver-B pg-dbname = S ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] store A LOT of 3-tuples for comparisons
On Fri, 22 Feb 2008, Moritz Onken wrote: I need to store a lot of 3-tuples of words (e.g. "he", "can", "drink"), order matters! The source is about 4 GB of these 3-tuples. I need to store them in a table and check whether one of them is already stored, and if that's the case to increment a column named "count" (or something). My suggestion would be to use three varchar columns to store the 3-tuples. You should then create a B-tree index on the three columns together. I thought of doing all the inserts without having an index and without doing the check whether the row is already there. After that I'd do a "group by" and count(*) on that table. Is this a good idea? That sounds like the fastest way to do it, certainly. Matthew -- "We have always been quite clear that Win95 and Win98 are not the systems to use if you are in a hostile security environment." "We absolutely do recognize that the Internet is a hostile environment." Paul Leach <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] store A LOT of 3-tuples for comparisons
Hi, I need to store a lot of 3-tuples of words (e.g. "he", "can", "drink"), order matters! The source is about 4 GB of these 3-tuples. I need to store them in a table and check whether one of them is already stored, and if that's the case to increment a column named "count" (or something). I thought of doing all the inserts without having an index and without doing the check whether the row is already there. After that I'd do a "group by" and count(*) on that table. Is this a good idea? I don't know much about Pgs data types. I'd try to use the varchar type. But maybe there is a better data type? What kind of index should I use? This is for a scientific research. Thanks in advance moritz ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] 7 hrs for a pg_restore?
Tom Lane writes: > Guillaume Cottenceau <[EMAIL PROTECTED]> writes: >> I have made a comparison restoring a production dump with default >> and large maintenance_work_mem. The speedup improvement here is >> only of 5% (12'30 => 11'50). > >> Apprently, on the restored database, data is 1337 MB[1] and >> indexes 644 MB[2][2]. Pg is 8.2.3, checkpoint_segments 3, >> maintenance_work_mem default (16MB) then 512MB, shared_buffers >> 384MB. It is rather slow disks (Dell's LSI Logic RAID1), hdparm >> reports 82 MB/sec for reads. > > The main thing that jumps out at me is that boosting checkpoint_segments > would probably help. I tend to set it to 30 or so (note that this > corresponds to about 1GB taken up by pg_xlog). Interestingly, from a bzipped dump, there is no win; however, from an uncompressed dump, increasing checkpoint_segments from 3 to 30 decreases clock time from 9'50 to 8'30 (15% if I'm correct). -- Guillaume Cottenceau ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org