Re: [PERFORM] Issues with \copy from file
I hope the issue is still open though I haven't replied to it before. Euler mentioned that I did not provide any details about my system. I'm using version 8.3 and with most settings default on an old machine with 2 GB of mem. The table definition is simple, four columns; id, value, x, y where id is primary key and x, y are combined into an index. I'm not sure if it matters but unlike Euler's suggestion I'm using \copy instead of COPY. Regarding my comparison to MySQL, it is completely valid. This is done on the same computer, using the same disk on the same platform. From that I would derive that IO is not my problem, unless postgresql is doing IO twice while MySQL only once. I guess my tables are InnoDB since that is the default type (or so I think). BEGIN/COMMIT I did not find change much. Are there any other suggestions ? My postgres.conf: #-- # RESOURCE USAGE (except WAL) #-- # - Memory - shared_buffers = 16MB # min 128kB or max_connections*16kB temp_buffers = 16MB # min 800kB #max_prepared_transactions = 5 # can be 0 or more # Note: Increasing max_prepared_transactions costs ~600 bytes of shared memory # per transaction slot, plus lock space (see max_locks_per_transaction). work_mem = 128MB# min 64kB maintenance_work_mem = 128MB# min 1MB #max_stack_depth = 2MB # min 100kB # - Free Space Map - max_fsm_pages = 2097152 # min max_fsm_relations*16, 6 bytes each max_fsm_relations = 500 # min 100, ~70 bytes each # - Kernel Resource Usage - #max_files_per_process = 1000 # min 25 #shared_preload_libraries = '' # (change requires restart) # - Cost-Based Vacuum Delay - #vacuum_cost_delay = 0 # 0-1000 milliseconds #vacuum_cost_page_hit = 1 # 0-1 credits #vacuum_cost_page_miss = 10 # 0-1 credits #vacuum_cost_page_dirty = 20# 0-1 credits #vacuum_cost_limit = 200# 1-1 credits #-- # WRITE AHEAD LOG #-- # - Settings - #fsync = on # turns forced synchronization on or off #synchronous_commit = on# immediate fsync at commit #wal_sync_method = fsync# the default is the first option #full_page_writes = on # recover from partial page writes #wal_buffers = 64kB # min 32kB #wal_writer_delay = 200ms # 1-1 milliseconds #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 # - Checkpoints - checkpoint_segments = 64# in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 is off # - Archiving - #archive_mode = off # allows archiving to be done #archive_command = '' # command to use to archive a logfile segment #archive_timeout = 0# force a logfile segment switch after this #-- # AUTOVACUUM PARAMETERS #-- autovacuum = on # Enable autovacuum subprocess? 'on' 2009/10/19 Matthew Wakeling matt...@flymine.org On Sun, 18 Oct 2009, Scott Marlowe wrote: You can only write data then commit it so fast to one drive, and that speed is usually somewhere in the megabyte per second range. 450+150 in 5 minutes is 120 Megs per second, that's pretty fast, but is likely the max speed of a modern super fast 15k rpm drive. If it's taking 20 minutes then it's 30 Megs per second which is still really good if you're in the middle of a busy afternoon and the db has other things to do. You're out by a factor of 60. That's minutes, not seconds. More relevant is the fact that Postgres will normally log changes in the WAL, effectively writing the data twice. As Euler said, the trick is to tell Postgres that noone else will need to see the data, so it can skip the WAL step: BEGIN; TRUNCATE TABLE foo; COPY foo FROM ...; COMMIT; I see upward of 100MB/s over here when I do this. Matthew -- Patron: I am looking for a globe of the earth. Librarian: We have a table-top model over here. Patron: No, that's not good enough. Don't you have a life-size? Librarian: (pause) Yes, but it's in use right now. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
On Thu, Dec 17, 2009 at 9:20 PM, Craig Ringer cr...@postnewspapers.com.au wrote: On 17/12/2009 11:57 PM, Tom Lane wrote: Thomas Hamiltonthomashamilto...@yahoo.com writes: But in our testing under the same optimization and conditions INNER JOIN is significantly outperforming IN. [ shrug... ] You haven't provided any details, so it's impossible to offer any useful advice. In other words: can we discuss this with reference to a specific case? Please provide your queries, your EXPLAIN ANALYZE output, and other relevant details as per: http://wiki.postgresql.org/wiki/SlowQueryQuestions I'd be interested in knowing whether the planner can perform such transformations and if so why it doesn't myself. I have the vague feeling there may be semantic differences in the handling of NULL but I can't currently seem to puzzle them out. NOT IN is the only that really kills you as far as optimization is concerned. IN can be transformed to a join. NOT IN forces a NOT (subplan)-type plan, which bites - hard. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: NOT IN is the only that really kills you as far as optimization is concerned. IN can be transformed to a join. NOT IN forces a NOT (subplan)-type plan, which bites - hard. in a well designed database (read: not abusing NULLs) - it can be done with joins too. -- GJ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Issues with \copy from file
On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson sgunn...@gmail.com wrote: I hope the issue is still open though I haven't replied to it before. Euler mentioned that I did not provide any details about my system. I'm using version 8.3 and with most settings default on an old machine with 2 GB of mem. The table definition is simple, four columns; id, value, x, y where id is primary key and x, y are combined into an index. I'm not sure if it matters but unlike Euler's suggestion I'm using \copy instead of COPY. Regarding my comparison to MySQL, it is completely valid. This is done on the same computer, using the same disk on the same platform. From that I would derive that IO is not my problem, unless postgresql is doing IO twice while MySQL only once. I guess my tables are InnoDB since that is the default type (or so I think). BEGIN/COMMIT I did not find change much. Are there any other suggestions ? Did you read Matthew Wakeling's reply? Arranging to skip WAL will help a lot here. To do that, you need to either create or truncate the table in the same transaction that does the COPY. The problem with the MySQL comparison is that it's not really relevant. It isn't that the PostgreSQL code just sucks and if we wrote it properly it would be as fast as MySQL. If that were the case, everyone would be up in arms, and it would have been fixed long ago. Rather, the problem is almost certainly that it's not an apples-to-apples comparison. MySQL is probably doing something different, such as perhaps not properly arranging for recovery if the system goes down in the middle of the copy, or just after it completes. But I don't know MySQL well enough to know exactly what the difference is, and I'm not particularly interested in spending a lot of time figuring it out. I think you'll get that reaction from others on this list as well, but of course that's up to them. Everybody here is a volunteer, of course, and generally our interest is principally PostgreSQL. On the other hand, we can certainly give you lots of information about what PostgreSQL is doing and why that takes the amount of time that it does, or give you information on how you can find out more about what it's doing. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com: On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: NOT IN is the only that really kills you as far as optimization is concerned. IN can be transformed to a join. NOT IN forces a NOT (subplan)-type plan, which bites - hard. in a well designed database (read: not abusing NULLs) - it can be done with joins too. But not by PostgreSQL, or so I believe. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
2009/12/18 Robert Haas robertmh...@gmail.com: 2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com: On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: NOT IN is the only that really kills you as far as optimization is concerned. IN can be transformed to a join. NOT IN forces a NOT (subplan)-type plan, which bites - hard. in a well designed database (read: not abusing NULLs) - it can be done with joins too. But not by PostgreSQL, or so I believe. using left join ? -- GJ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Idea how to get rid of Bitmap Heap Scan
Hello! There are many questions on internet about whether it is possible to optimize Bitmap Heap Scan somehow without answer, so seems like problem is rather important. The query I need to optimize is: EXPLAIN SELECT date_trunc('day', d.created_at) AS day, COUNT(*) AS download FROM downloads d WHERE d.file_id in (select id from files where owner_id = 443) AND d.download_status != 0 AND d.created_at = '2009-12-05' AND d.created_at '2009-12-16' GROUP BY 1; QUERY PLAN -- HashAggregate (cost=15809.49..17126.20 rows=87781 width=8) - Hash Semi Join (cost=5809.51..15368.11 rows=88276 width=8) Hash Cond: (d.file_id = files.id) - Index Scan using idx_downloads_created_at on downloads d (cost=0.00..7682.73 rows=88276 width=16) Index Cond: ((created_at = '2009-12-05 00:00:00'::timestamp without time zone) AND (created_at '2009-12-16 00:00:00'::timestamp without time zone)) - Hash (cost=5741.51..5741.51 rows=5440 width=8) - Bitmap Heap Scan on files (cost=106.42..5741.51 rows=5440 width=8) Recheck Cond: (owner_id = 443) - Bitmap Index Scan on idx_files_owner (cost=0.00..105.06 rows=5440 width=0) Index Cond: (owner_id = 443) The problem here is that we are forced to fetch files in Bitmap Heap Scan. But actually there is no need for the whole files record. The necessary data is only files ids. The idea is to avoid fetching data from files table, and get the ids from index! (probably it is a little bit tricky, but it is a performance area...) I created an index with following command: create index idx_files_owner_id ON files (owner_id, id); and even tried to remove old index to enforce postgresql to use newly created index. But postgresql still do Bitmap Heap Scan. (The other idea is to use raw_id as a primary key of files table to don't extend index. But I don't know whether it is possible at all or this idea have some drawbacks) I think it worth to learn postgreql to do this trick especially taking into account there are many questions about whether it is possible to optimize such a queries. If there is an known solution to this problem please provide a link to it. With best regards, Michael Mikhulya. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan
On Fri, 18 Dec 2009, Michael N. Mikhulya wrote: The problem here is that we are forced to fetch files in Bitmap Heap Scan. But actually there is no need for the whole files record. The necessary data is only files ids. The idea is to avoid fetching data from files table, and get the ids from index! (probably it is a little bit tricky, but it is a performance area...) Unfortunately, the index does not contain enough information to accomplish this. This is due to Postgres' advanced concurrency control system. Postgres needs to fetch the actual rows from the files table in order to check whether that row is visible in the current transaction, and a Bitmap Index Scan is the fastest way to do this. You can speed this up in Postgres 8.4 by having a RAID array and setting the effective_concurrency configuration to the number of spindles in the RAID array, or by having gobs of RAM and keeping everything in cache. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Issues with \copy from file
The intention was never to talk down postgresql but rather trying to get some explanation of this difference so that I could do the proper changes. After having read the link from Euler's post, which I oversaw, I have managed to shorten the import time. My problem was with the indexes. I was able to shorten the import time, of a 26 million line import, from 2 hours + (I gave up after that time) downto 12 minutes by dropping the indexes after truncate and before copy. This is what I was expecting and I'm therefore satisfied with the result. Regards, Sigurgeir 2009/12/18 Robert Haas robertmh...@gmail.com On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson sgunn...@gmail.com wrote: I hope the issue is still open though I haven't replied to it before. Euler mentioned that I did not provide any details about my system. I'm using version 8.3 and with most settings default on an old machine with 2 GB of mem. The table definition is simple, four columns; id, value, x, y where id is primary key and x, y are combined into an index. I'm not sure if it matters but unlike Euler's suggestion I'm using \copy instead of COPY. Regarding my comparison to MySQL, it is completely valid. This is done on the same computer, using the same disk on the same platform. From that I would derive that IO is not my problem, unless postgresql is doing IO twice while MySQL only once. I guess my tables are InnoDB since that is the default type (or so I think). BEGIN/COMMIT I did not find change much. Are there any other suggestions ? Did you read Matthew Wakeling's reply? Arranging to skip WAL will help a lot here. To do that, you need to either create or truncate the table in the same transaction that does the COPY. The problem with the MySQL comparison is that it's not really relevant. It isn't that the PostgreSQL code just sucks and if we wrote it properly it would be as fast as MySQL. If that were the case, everyone would be up in arms, and it would have been fixed long ago. Rather, the problem is almost certainly that it's not an apples-to-apples comparison. MySQL is probably doing something different, such as perhaps not properly arranging for recovery if the system goes down in the middle of the copy, or just after it completes. But I don't know MySQL well enough to know exactly what the difference is, and I'm not particularly interested in spending a lot of time figuring it out. I think you'll get that reaction from others on this list as well, but of course that's up to them. Everybody here is a volunteer, of course, and generally our interest is principally PostgreSQL. On the other hand, we can certainly give you lots of information about what PostgreSQL is doing and why that takes the amount of time that it does, or give you information on how you can find out more about what it's doing. ...Robert
Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan
Thank you very much. I catch the point why it is done so. But I'm curious whether it is still possible to don't fetch data from files table just because inappropriate ids (e.g. removed ones) will not produce any wrong effect just because them indirectly checked on downloads table? Here I mean that if we get id (from index) for file which is actually removed, then we will not find anything in downloads table. Probably my knowledge about MVCC is too little to see whole picture, so if it is not hard to you please point the failure scenario (when we get wrong result) or locking issue, ... Michael Mikhulya Unfortunately, the index does not contain enough information to accomplish this. This is due to Postgres' advanced concurrency control system. Postgres needs to fetch the actual rows from the files table in order to check whether that row is visible in the current transaction, and a Bitmap Index Scan is the fastest way to do this. You can speed this up in Postgres 8.4 by having a RAID array and setting the effective_concurrency configuration to the number of spindles in the RAID array, or by having gobs of RAM and keeping everything in cache. Matthew -- A good programmer is one who looks both ways before crossing a one-way street. Considering the quality and quantity of one-way streets in Cambridge, it should be no surprise that there are so many good programmers there. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan
On Fri, Dec 18, 2009 at 4:18 PM, Michael N. Mikhulya m.mikhu...@gmail.com wrote: Thank you very much. I catch the point why it is done so. But I'm curious whether it is still possible to don't fetch data from files table just because inappropriate ids (e.g. removed ones) will not produce any wrong effect just because them indirectly checked on downloads table? Here I mean that if we get id (from index) for file which is actually removed, then we will not find anything in downloads table. Probably my knowledge about MVCC is too little to see whole picture, so if it is not hard to you please point the failure scenario (when we get wrong result) or locking issue, ... Yup this ought to be possible and fruitful, I believe Heikki already produced a partial patch to this end. If you're interested in working on it you could skim back in the logs and start with that. I don't recall any special keywords to search on but it might be in one of the threads for the visibility map or it might be under index-only scans. A word of warning, in my experience the hardest part for changes like this isn't the executor changes (which in this case wouldn't be far from easy) but the planner changes to detect when this new plan would be better. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com: 2009/12/18 Robert Haas robertmh...@gmail.com: 2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com: On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: NOT IN is the only that really kills you as far as optimization is concerned. IN can be transformed to a join. NOT IN forces a NOT (subplan)-type plan, which bites - hard. in a well designed database (read: not abusing NULLs) - it can be done with joins too. But not by PostgreSQL, or so I believe. using left join ? If at least one column in the subselect is strict, you can rewrite it that way yourself, but the optimizer won't do it. I wish it did, but I don't wish it badly enough to have written the code myself, and apparently neither does anyone else. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Issues with \copy from file
On Fri, Dec 18, 2009 at 10:51 AM, Sigurgeir Gunnarsson sgunn...@gmail.com wrote: The intention was never to talk down postgresql but rather trying to get some explanation of this difference so that I could do the proper changes. After having read the link from Euler's post, which I oversaw, I have managed to shorten the import time. My problem was with the indexes. I was able to shorten the import time, of a 26 million line import, from 2 hours + (I gave up after that time) downto 12 minutes by dropping the indexes after truncate and before copy. This is what I was expecting and I'm therefore satisfied with the result. Ah ha! Well, it sounds like perhaps you have the answer to what was causing the difference too, then. I'm not trying to be unhelpful, just trying to explain honestly why you might not get exactly the response you expect to MySQL comparisons - we only understand half of it. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
Robert Haas robertmh...@gmail.com writes: If at least one column in the subselect is strict, you can rewrite it that way yourself, but the optimizer won't do it. I wish it did, but I don't wish it badly enough to have written the code myself, and apparently neither does anyone else. I was thinking about this earlier today. It's a bit of a PITA because we need the information very early in the planner, before it's done much analysis. So for example we might find ourselves duplicating the work that will happen later to determine which tables are nullable by outer joins. I think this would be all right as long as we ensure that it's only done when there's a chance for a win (ie, no extra cycles if there's not actually a NOT IN present). It could still be an unpleasantly large amount of new code though. Wouldn't we need to enforce that *all* columns of the subselect are non-null, rather than *any*? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN
On Fri, Dec 18, 2009 at 7:32 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: If at least one column in the subselect is strict, you can rewrite it that way yourself, but the optimizer won't do it. I wish it did, but I don't wish it badly enough to have written the code myself, and apparently neither does anyone else. I was thinking about this earlier today. It's a bit of a PITA because we need the information very early in the planner, before it's done much analysis. So for example we might find ourselves duplicating the work that will happen later to determine which tables are nullable by outer joins. I think this would be all right as long as we ensure that it's only done when there's a chance for a win (ie, no extra cycles if there's not actually a NOT IN present). It could still be an unpleasantly large amount of new code though. I haven't looked at the code (I'm not even sure where you're thinking this would need to happen) but is there any way that we can do this and usefully hold onto the results for future use? Wouldn't we need to enforce that *all* columns of the subselect are non-null, rather than *any*? [ thinks about it ] Yes. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance