Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Sigurgeir Gunnarsson
I hope the issue is still open though I haven't replied to it before.

Euler mentioned that I did not provide any details about my system. I'm
using version 8.3 and with most settings default on an old machine with 2 GB
of mem. The table definition is simple, four columns; id, value, x, y where
id is primary key and x, y are combined into an index.

I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
instead of COPY. Regarding my comparison to MySQL, it is completely valid.
This is done on the same computer, using the same disk on the same platform.
From that I would derive that IO is not my problem, unless postgresql is
doing IO twice while MySQL only once.

I guess my tables are InnoDB since that is the default type (or so I think).
BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

My postgres.conf:
#--
# RESOURCE USAGE (except WAL)
#--

# - Memory -
shared_buffers = 16MB   # min 128kB or max_connections*16kB
temp_buffers = 16MB # min 800kB
#max_prepared_transactions = 5  # can be 0 or more
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
work_mem = 128MB# min 64kB
maintenance_work_mem = 128MB# min 1MB
#max_stack_depth = 2MB  # min 100kB

# - Free Space Map -
max_fsm_pages = 2097152 # min max_fsm_relations*16, 6 bytes
each
max_fsm_relations = 500 # min 100, ~70 bytes each

# - Kernel Resource Usage -
#max_files_per_process = 1000   # min 25
#shared_preload_libraries = ''  # (change requires restart)
# - Cost-Based Vacuum Delay -
#vacuum_cost_delay = 0  # 0-1000 milliseconds
#vacuum_cost_page_hit = 1   # 0-1 credits
#vacuum_cost_page_miss = 10 # 0-1 credits
#vacuum_cost_page_dirty = 20# 0-1 credits
#vacuum_cost_limit = 200# 1-1 credits

#--
# WRITE AHEAD LOG
#--

# - Settings -
#fsync = on # turns forced synchronization on or
off
#synchronous_commit = on# immediate fsync at commit
#wal_sync_method = fsync# the default is the first option
#full_page_writes = on  # recover from partial page writes
#wal_buffers = 64kB # min 32kB
#wal_writer_delay = 200ms   # 1-1 milliseconds
#commit_delay = 0   # range 0-10, in microseconds
#commit_siblings = 5# range 1-1000

# - Checkpoints -
checkpoint_segments = 64# in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
#checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 -
1.0
#checkpoint_warning = 30s   # 0 is off

# - Archiving -
#archive_mode = off # allows archiving to be done
#archive_command = ''   # command to use to archive a logfile
segment
#archive_timeout = 0# force a logfile segment switch after this

#--
# AUTOVACUUM PARAMETERS
#--

autovacuum = on # Enable autovacuum subprocess?
'on'


2009/10/19 Matthew Wakeling matt...@flymine.org

 On Sun, 18 Oct 2009, Scott Marlowe wrote:

 You can only write data then commit it so fast to one drive, and that
 speed is usually somewhere in the megabyte per second range.  450+150
 in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
 the max speed of a modern super fast 15k rpm drive.  If it's taking 20
 minutes then it's 30 Megs per second which is still really good if
 you're in the middle of a busy afternoon and the db has other things
 to do.


 You're out by a factor of 60. That's minutes, not seconds.

 More relevant is the fact that Postgres will normally log changes in the
 WAL, effectively writing the data twice. As Euler said, the trick is to tell
 Postgres that noone else will need to see the data, so it can skip the WAL
 step:


  BEGIN;
 TRUNCATE TABLE foo;
 COPY foo FROM ...;
 COMMIT;


 I see upward of 100MB/s over here when I do this.

 Matthew

 --
 Patron: I am looking for a globe of the earth.
 Librarian: We have a table-top model over here.
 Patron: No, that's not good enough. Don't you have a life-size?
 Librarian: (pause) Yes, but it's in use right now.


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your 

Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
On Thu, Dec 17, 2009 at 9:20 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 On 17/12/2009 11:57 PM, Tom Lane wrote:

 Thomas Hamiltonthomashamilto...@yahoo.com  writes:

 But in our testing under the same optimization and conditions INNER JOIN
 is significantly outperforming IN.

 [ shrug... ]  You haven't provided any details, so it's impossible to
 offer any useful advice.

 In other words: can we discuss this with reference to a specific case?
 Please provide your queries, your EXPLAIN ANALYZE output, and other relevant
 details as per:

  http://wiki.postgresql.org/wiki/SlowQueryQuestions

 I'd be interested in knowing whether the planner can perform such
 transformations and if so why it doesn't myself. I have the vague feeling
 there may be semantic differences in the handling of NULL but I can't
 currently seem to puzzle them out.

NOT IN is the only that really kills you as far as optimization is
concerned.  IN can be transformed to a join.  NOT IN forces a NOT
(subplan)-type plan, which bites - hard.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Grzegorz Jaśkiewicz
On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:

 NOT IN is the only that really kills you as far as optimization is
 concerned.  IN can be transformed to a join.  NOT IN forces a NOT
 (subplan)-type plan, which bites - hard.

in a well designed database (read: not abusing NULLs) - it can be done
with joins too.



-- 
GJ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
sgunn...@gmail.com wrote:
 I hope the issue is still open though I haven't replied to it before.

 Euler mentioned that I did not provide any details about my system. I'm
 using version 8.3 and with most settings default on an old machine with 2 GB
 of mem. The table definition is simple, four columns; id, value, x, y where
 id is primary key and x, y are combined into an index.

 I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
 instead of COPY. Regarding my comparison to MySQL, it is completely valid.
 This is done on the same computer, using the same disk on the same platform.
 From that I would derive that IO is not my problem, unless postgresql is
 doing IO twice while MySQL only once.

 I guess my tables are InnoDB since that is the default type (or so I think).
 BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
help a lot here.  To do that, you need to either create or truncate
the table in the same transaction that does the COPY.

The problem with the MySQL comparison is that it's not really
relevant.   It isn't that the PostgreSQL code just sucks and if we
wrote it properly it would be as fast as MySQL.  If that were the
case, everyone would be up in arms, and it would have been fixed long
ago.  Rather, the problem is almost certainly that it's not an
apples-to-apples comparison.  MySQL is probably doing something
different, such as perhaps not properly arranging for recovery if the
system goes down in the middle of the copy, or just after it
completes.  But I don't know MySQL well enough to know exactly what
the difference is, and I'm not particularly interested in spending a
lot of time figuring it out.  I think you'll get that reaction from
others on this list as well, but of course that's up to them.
Everybody here is a volunteer, of course, and generally our interest
is principally PostgreSQL.

On the other hand, we can certainly give you lots of information about
what PostgreSQL is doing and why that takes the amount of time that it
does, or give you information on how you can find out more about what
it's doing.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com:
 On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:

 NOT IN is the only that really kills you as far as optimization is
 concerned.  IN can be transformed to a join.  NOT IN forces a NOT
 (subplan)-type plan, which bites - hard.

 in a well designed database (read: not abusing NULLs) - it can be done
 with joins too.

But not by PostgreSQL, or so I believe.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Grzegorz Jaśkiewicz
2009/12/18 Robert Haas robertmh...@gmail.com:
 2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com:
 On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:

 NOT IN is the only that really kills you as far as optimization is
 concerned.  IN can be transformed to a join.  NOT IN forces a NOT
 (subplan)-type plan, which bites - hard.

 in a well designed database (read: not abusing NULLs) - it can be done
 with joins too.

 But not by PostgreSQL, or so I believe.

using left join ?



-- 
GJ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Michael N. Mikhulya
Hello!

There are many questions on internet about whether it is possible to
optimize Bitmap Heap Scan somehow without answer, so seems like
problem is rather important.

The query I need to optimize is:

EXPLAIN SELECT date_trunc('day', d.created_at) AS day, COUNT(*) AS
download FROM downloads d WHERE d.file_id in (select id from files
where owner_id = 443) AND d.download_status != 0 AND d.created_at =
'2009-12-05' AND d.created_at  '2009-12-16' GROUP BY 1;

   QUERY PLAN
--
 HashAggregate  (cost=15809.49..17126.20 rows=87781 width=8)
   -  Hash Semi Join  (cost=5809.51..15368.11 rows=88276 width=8)
 Hash Cond: (d.file_id = files.id)
 -  Index Scan using idx_downloads_created_at on downloads d
(cost=0.00..7682.73 rows=88276 width=16)
   Index Cond: ((created_at = '2009-12-05
00:00:00'::timestamp without time zone) AND (created_at  '2009-12-16
00:00:00'::timestamp without time zone))
 -  Hash  (cost=5741.51..5741.51 rows=5440 width=8)
   -  Bitmap Heap Scan on files  (cost=106.42..5741.51
rows=5440 width=8)
 Recheck Cond: (owner_id = 443)
 -  Bitmap Index Scan on idx_files_owner
(cost=0.00..105.06 rows=5440 width=0)
   Index Cond: (owner_id = 443)

The problem here is that we are forced to fetch files in Bitmap Heap Scan.
But actually there is no need for the whole files record. The
necessary data is only files ids.

The idea is to avoid fetching data from files table, and get the ids
from index! (probably it is a little bit tricky, but it is a
performance area...)

I created an index with following command:
create index idx_files_owner_id ON files (owner_id, id);
and even tried to remove old index to enforce postgresql to use newly
created index.
But postgresql still do Bitmap Heap Scan.

(The other idea is to use raw_id as a primary key of files table to
don't extend index. But I don't know whether it is possible at all or
this idea have some drawbacks)

I think it worth to learn postgreql to do this trick especially taking
into account there are many questions about whether it is possible to
optimize such a queries.

If there is an known solution to this problem please provide a link to it.

With best regards,
Michael Mikhulya.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Matthew Wakeling

On Fri, 18 Dec 2009, Michael N. Mikhulya wrote:

The problem here is that we are forced to fetch files in Bitmap Heap Scan.
But actually there is no need for the whole files record. The
necessary data is only files ids.

The idea is to avoid fetching data from files table, and get the ids
from index! (probably it is a little bit tricky, but it is a
performance area...)


Unfortunately, the index does not contain enough information to accomplish 
this. This is due to Postgres' advanced concurrency control system. 
Postgres needs to fetch the actual rows from the files table in order to 
check whether that row is visible in the current transaction, and a Bitmap 
Index Scan is the fastest way to do this.


You can speed this up in Postgres 8.4 by having a RAID array and setting 
the effective_concurrency configuration to the number of spindles in the 
RAID array, or by having gobs of RAM and keeping everything in cache.


Matthew

--
A good programmer is one who looks both ways before crossing a one-way street.
Considering the quality and quantity of one-way streets in Cambridge, it
should be no surprise that there are so many good programmers there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Sigurgeir Gunnarsson
The intention was never to talk down postgresql but rather trying to get
some explanation of this difference so that I could do the proper changes.

After having read the link from Euler's post, which I oversaw, I have
managed to shorten the import time. My problem was with the indexes. I was
able to shorten the import time, of a 26 million line import, from 2 hours +
(I gave up after that time) downto 12 minutes by dropping the indexes after
truncate and before copy.

This is what I was expecting and I'm therefore satisfied with the result.

Regards, Sigurgeir

2009/12/18 Robert Haas robertmh...@gmail.com

 On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
 sgunn...@gmail.com wrote:
  I hope the issue is still open though I haven't replied to it before.
 
  Euler mentioned that I did not provide any details about my system. I'm
  using version 8.3 and with most settings default on an old machine with 2
 GB
  of mem. The table definition is simple, four columns; id, value, x, y
 where
  id is primary key and x, y are combined into an index.
 
  I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
  instead of COPY. Regarding my comparison to MySQL, it is completely
 valid.
  This is done on the same computer, using the same disk on the same
 platform.
  From that I would derive that IO is not my problem, unless postgresql is
  doing IO twice while MySQL only once.
 
  I guess my tables are InnoDB since that is the default type (or so I
 think).
  BEGIN/COMMIT I did not find change much. Are there any other suggestions
 ?

 Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
 help a lot here.  To do that, you need to either create or truncate
 the table in the same transaction that does the COPY.

 The problem with the MySQL comparison is that it's not really
 relevant.   It isn't that the PostgreSQL code just sucks and if we
 wrote it properly it would be as fast as MySQL.  If that were the
 case, everyone would be up in arms, and it would have been fixed long
 ago.  Rather, the problem is almost certainly that it's not an
 apples-to-apples comparison.  MySQL is probably doing something
 different, such as perhaps not properly arranging for recovery if the
 system goes down in the middle of the copy, or just after it
 completes.  But I don't know MySQL well enough to know exactly what
 the difference is, and I'm not particularly interested in spending a
 lot of time figuring it out.  I think you'll get that reaction from
 others on this list as well, but of course that's up to them.
 Everybody here is a volunteer, of course, and generally our interest
 is principally PostgreSQL.

 On the other hand, we can certainly give you lots of information about
 what PostgreSQL is doing and why that takes the amount of time that it
 does, or give you information on how you can find out more about what
 it's doing.

 ...Robert



Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Michael N. Mikhulya
Thank you very much. I catch the point why it is done so.

But I'm curious whether it is still possible to don't fetch data from
files table just because inappropriate ids (e.g. removed ones) will
not produce any wrong effect just because them indirectly checked on
downloads table?
Here I mean that if we get id (from index) for file which is actually
removed, then we will not find anything in downloads table.
Probably my knowledge about MVCC is too little to see whole picture,
so if it is not hard to you please point the failure scenario (when
we get wrong result) or locking issue, ...

Michael Mikhulya

 Unfortunately, the index does not contain enough information to accomplish
 this. This is due to Postgres' advanced concurrency control system. Postgres
 needs to fetch the actual rows from the files table in order to check
 whether that row is visible in the current transaction, and a Bitmap Index
 Scan is the fastest way to do this.

 You can speed this up in Postgres 8.4 by having a RAID array and setting the
 effective_concurrency configuration to the number of spindles in the RAID
 array, or by having gobs of RAM and keeping everything in cache.

 Matthew

 --
 A good programmer is one who looks both ways before crossing a one-way
 street.
 Considering the quality and quantity of one-way streets in Cambridge, it
 should be no surprise that there are so many good programmers there.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Greg Stark
On Fri, Dec 18, 2009 at 4:18 PM, Michael N. Mikhulya
m.mikhu...@gmail.com wrote:
 Thank you very much. I catch the point why it is done so.

 But I'm curious whether it is still possible to don't fetch data from
 files table just because inappropriate ids (e.g. removed ones) will
 not produce any wrong effect just because them indirectly checked on
 downloads table?
 Here I mean that if we get id (from index) for file which is actually
 removed, then we will not find anything in downloads table.
 Probably my knowledge about MVCC is too little to see whole picture,
 so if it is not hard to you please point the failure scenario (when
 we get wrong result) or locking issue, ...


Yup this ought to be possible and fruitful, I believe Heikki already
produced a partial patch to this end. If you're interested in working
on it you could skim back in the logs and start with that. I don't
recall any special keywords to search on but it might be in one of the
threads for the visibility map or it might be under index-only
scans.

A word of warning, in my experience the hardest part for changes like
this isn't the executor changes (which in this case wouldn't be far
from easy) but the planner changes to detect when this new plan would
be better.

-- 
greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com:
 2009/12/18 Robert Haas robertmh...@gmail.com:
 2009/12/18 Grzegorz Jaśkiewicz gryz...@gmail.com:
 On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:

 NOT IN is the only that really kills you as far as optimization is
 concerned.  IN can be transformed to a join.  NOT IN forces a NOT
 (subplan)-type plan, which bites - hard.

 in a well designed database (read: not abusing NULLs) - it can be done
 with joins too.

 But not by PostgreSQL, or so I believe.

 using left join ?

If at least one column in the subselect is strict, you can rewrite it
that way yourself, but the optimizer won't do it. I wish it did, but I
don't wish it badly enough to have written the code myself, and
apparently neither does anyone else.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 10:51 AM, Sigurgeir Gunnarsson
sgunn...@gmail.com wrote:
 The intention was never to talk down postgresql but rather trying to get
 some explanation of this difference so that I could do the proper changes.

 After having read the link from Euler's post, which I oversaw, I have
 managed to shorten the import time. My problem was with the indexes. I was
 able to shorten the import time, of a 26 million line import, from 2 hours +
 (I gave up after that time) downto 12 minutes by dropping the indexes after
 truncate and before copy.

 This is what I was expecting and I'm therefore satisfied with the result.

Ah ha!  Well, it sounds like perhaps you have the answer to what was
causing the difference too, then.  I'm not trying to be unhelpful,
just trying to explain honestly why you might not get exactly the
response you expect to MySQL comparisons - we only understand half of
it.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 If at least one column in the subselect is strict, you can rewrite it
 that way yourself, but the optimizer won't do it. I wish it did, but I
 don't wish it badly enough to have written the code myself, and
 apparently neither does anyone else.

I was thinking about this earlier today.  It's a bit of a PITA because
we need the information very early in the planner, before it's done much
analysis.  So for example we might find ourselves duplicating the work
that will happen later to determine which tables are nullable by outer
joins.  I think this would be all right as long as we ensure that it's
only done when there's a chance for a win (ie, no extra cycles if
there's not actually a NOT IN present).  It could still be an
unpleasantly large amount of new code though.

Wouldn't we need to enforce that *all* columns of the subselect are
non-null, rather than *any*?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 7:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 If at least one column in the subselect is strict, you can rewrite it
 that way yourself, but the optimizer won't do it. I wish it did, but I
 don't wish it badly enough to have written the code myself, and
 apparently neither does anyone else.

 I was thinking about this earlier today.  It's a bit of a PITA because
 we need the information very early in the planner, before it's done much
 analysis.  So for example we might find ourselves duplicating the work
 that will happen later to determine which tables are nullable by outer
 joins.  I think this would be all right as long as we ensure that it's
 only done when there's a chance for a win (ie, no extra cycles if
 there's not actually a NOT IN present).  It could still be an
 unpleasantly large amount of new code though.

I haven't looked at the code (I'm not even sure where you're thinking
this would need to happen) but is there any way that we can do this
and usefully hold onto the results for future use?

 Wouldn't we need to enforce that *all* columns of the subselect are
 non-null, rather than *any*?

[ thinks about it ]

Yes.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance