Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Sigurgeir Gunnarsson
I hope the issue is still open though I haven't replied to it before.

Euler mentioned that I did not provide any details about my system. I'm
using version 8.3 and with most settings default on an old machine with 2 GB
of mem. The table definition is simple, four columns; id, value, x, y where
id is primary key and x, y are combined into an index.

I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
instead of COPY. Regarding my comparison to MySQL, it is completely valid.
This is done on the same computer, using the same disk on the same platform.
>From that I would derive that IO is not my problem, unless postgresql is
doing IO twice while MySQL only once.

I guess my tables are InnoDB since that is the default type (or so I think).
BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

My postgres.conf:
#--
# RESOURCE USAGE (except WAL)
#--

# - Memory -
shared_buffers = 16MB   # min 128kB or max_connections*16kB
temp_buffers = 16MB # min 800kB
#max_prepared_transactions = 5  # can be 0 or more
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
work_mem = 128MB# min 64kB
maintenance_work_mem = 128MB# min 1MB
#max_stack_depth = 2MB  # min 100kB

# - Free Space Map -
max_fsm_pages = 2097152 # min max_fsm_relations*16, 6 bytes
each
max_fsm_relations = 500 # min 100, ~70 bytes each

# - Kernel Resource Usage -
#max_files_per_process = 1000   # min 25
#shared_preload_libraries = ''  # (change requires restart)
# - Cost-Based Vacuum Delay -
#vacuum_cost_delay = 0  # 0-1000 milliseconds
#vacuum_cost_page_hit = 1   # 0-1 credits
#vacuum_cost_page_miss = 10 # 0-1 credits
#vacuum_cost_page_dirty = 20# 0-1 credits
#vacuum_cost_limit = 200# 1-1 credits

#--
# WRITE AHEAD LOG
#--

# - Settings -
#fsync = on # turns forced synchronization on or
off
#synchronous_commit = on# immediate fsync at commit
#wal_sync_method = fsync# the default is the first option
#full_page_writes = on  # recover from partial page writes
#wal_buffers = 64kB # min 32kB
#wal_writer_delay = 200ms   # 1-1 milliseconds
#commit_delay = 0   # range 0-10, in microseconds
#commit_siblings = 5# range 1-1000

# - Checkpoints -
checkpoint_segments = 64# in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
#checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 -
1.0
#checkpoint_warning = 30s   # 0 is off

# - Archiving -
#archive_mode = off # allows archiving to be done
#archive_command = ''   # command to use to archive a logfile
segment
#archive_timeout = 0# force a logfile segment switch after this

#--
# AUTOVACUUM PARAMETERS
#--

autovacuum = on # Enable autovacuum subprocess?
'on'


2009/10/19 Matthew Wakeling 

> On Sun, 18 Oct 2009, Scott Marlowe wrote:
>
>> You can only write data then commit it so fast to one drive, and that
>> speed is usually somewhere in the megabyte per second range.  450+150
>> in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
>> the max speed of a modern super fast 15k rpm drive.  If it's taking 20
>> minutes then it's 30 Megs per second which is still really good if
>> you're in the middle of a busy afternoon and the db has other things
>> to do.
>>
>
> You're out by a factor of 60. That's minutes, not seconds.
>
> More relevant is the fact that Postgres will normally log changes in the
> WAL, effectively writing the data twice. As Euler said, the trick is to tell
> Postgres that noone else will need to see the data, so it can skip the WAL
> step:
>
>
>  BEGIN;
>> TRUNCATE TABLE foo;
>> COPY foo FROM ...;
>> COMMIT;
>>
>
> I see upward of 100MB/s over here when I do this.
>
> Matthew
>
> --
> Patron: "I am looking for a globe of the earth."
> Librarian: "We have a table-top model over here."
> Patron: "No, that's not good enough. Don't you have a life-size?"
> Librarian: (pause) "Yes, but it's in use right now."
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.o

Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
On Thu, Dec 17, 2009 at 9:20 PM, Craig Ringer
 wrote:
> On 17/12/2009 11:57 PM, Tom Lane wrote:
>>
>> Thomas Hamilton  writes:
>>>
>>> But in our testing under the same optimization and conditions INNER JOIN
>>> is significantly outperforming IN.
>>
>> [ shrug... ]  You haven't provided any details, so it's impossible to
>> offer any useful advice.
>
> In other words: can we discuss this with reference to a specific case?
> Please provide your queries, your EXPLAIN ANALYZE output, and other relevant
> details as per:
>
>  http://wiki.postgresql.org/wiki/SlowQueryQuestions
>
> I'd be interested in knowing whether the planner can perform such
> transformations and if so why it doesn't myself. I have the vague feeling
> there may be semantic differences in the handling of NULL but I can't
> currently seem to puzzle them out.

NOT IN is the only that really kills you as far as optimization is
concerned.  IN can be transformed to a join.  NOT IN forces a NOT
(subplan)-type plan, which bites - hard.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Grzegorz Jaśkiewicz
On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas  wrote:

> NOT IN is the only that really kills you as far as optimization is
> concerned.  IN can be transformed to a join.  NOT IN forces a NOT
> (subplan)-type plan, which bites - hard.

in a well designed database (read: not abusing NULLs) - it can be done
with joins too.



-- 
GJ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
 wrote:
> I hope the issue is still open though I haven't replied to it before.
>
> Euler mentioned that I did not provide any details about my system. I'm
> using version 8.3 and with most settings default on an old machine with 2 GB
> of mem. The table definition is simple, four columns; id, value, x, y where
> id is primary key and x, y are combined into an index.
>
> I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
> instead of COPY. Regarding my comparison to MySQL, it is completely valid.
> This is done on the same computer, using the same disk on the same platform.
> From that I would derive that IO is not my problem, unless postgresql is
> doing IO twice while MySQL only once.
>
> I guess my tables are InnoDB since that is the default type (or so I think).
> BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
help a lot here.  To do that, you need to either create or truncate
the table in the same transaction that does the COPY.

The problem with the MySQL comparison is that it's not really
relevant.   It isn't that the PostgreSQL code just sucks and if we
wrote it properly it would be as fast as MySQL.  If that were the
case, everyone would be up in arms, and it would have been fixed long
ago.  Rather, the problem is almost certainly that it's not an
apples-to-apples comparison.  MySQL is probably doing something
different, such as perhaps not properly arranging for recovery if the
system goes down in the middle of the copy, or just after it
completes.  But I don't know MySQL well enough to know exactly what
the difference is, and I'm not particularly interested in spending a
lot of time figuring it out.  I think you'll get that reaction from
others on this list as well, but of course that's up to them.
Everybody here is a volunteer, of course, and generally our interest
is principally PostgreSQL.

On the other hand, we can certainly give you lots of information about
what PostgreSQL is doing and why that takes the amount of time that it
does, or give you information on how you can find out more about what
it's doing.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
2009/12/18 Grzegorz Jaśkiewicz :
> On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas  wrote:
>
>> NOT IN is the only that really kills you as far as optimization is
>> concerned.  IN can be transformed to a join.  NOT IN forces a NOT
>> (subplan)-type plan, which bites - hard.
>
> in a well designed database (read: not abusing NULLs) - it can be done
> with joins too.

But not by PostgreSQL, or so I believe.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Grzegorz Jaśkiewicz
2009/12/18 Robert Haas :
> 2009/12/18 Grzegorz Jaśkiewicz :
>> On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas  wrote:
>>
>>> NOT IN is the only that really kills you as far as optimization is
>>> concerned.  IN can be transformed to a join.  NOT IN forces a NOT
>>> (subplan)-type plan, which bites - hard.
>>
>> in a well designed database (read: not abusing NULLs) - it can be done
>> with joins too.
>
> But not by PostgreSQL, or so I believe.

using left join ?



-- 
GJ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Michael N. Mikhulya
Hello!

There are many questions on internet about whether it is possible to
optimize "Bitmap Heap Scan" somehow without answer, so seems like
problem is rather important.

The query I need to optimize is:

EXPLAIN SELECT date_trunc('day', d.created_at) AS day, COUNT(*) AS
download FROM downloads d WHERE d.file_id in (select id from files
where owner_id = 443) AND d.download_status != 0 AND d.created_at >=
'2009-12-05' AND d.created_at < '2009-12-16' GROUP BY 1;

   QUERY PLAN
--
 HashAggregate  (cost=15809.49..17126.20 rows=87781 width=8)
   ->  Hash Semi Join  (cost=5809.51..15368.11 rows=88276 width=8)
 Hash Cond: (d.file_id = files.id)
 ->  Index Scan using idx_downloads_created_at on downloads d
(cost=0.00..7682.73 rows=88276 width=16)
   Index Cond: ((created_at >= '2009-12-05
00:00:00'::timestamp without time zone) AND (created_at < '2009-12-16
00:00:00'::timestamp without time zone))
 ->  Hash  (cost=5741.51..5741.51 rows=5440 width=8)
   ->  Bitmap Heap Scan on files  (cost=106.42..5741.51
rows=5440 width=8)
 Recheck Cond: (owner_id = 443)
 ->  Bitmap Index Scan on idx_files_owner
(cost=0.00..105.06 rows=5440 width=0)
   Index Cond: (owner_id = 443)

The problem here is that we are forced to fetch "files" in Bitmap Heap Scan.
But actually there is no need for the whole "files" record. The
necessary data is only "files" ids.

The idea is to avoid fetching data from "files" table, and get the ids
from index! (probably it is a little bit tricky, but it is a
performance area...)

I created an index with following command:
create index idx_files_owner_id ON files (owner_id, id);
and even tried to remove old index to enforce postgresql to use newly
created index.
But postgresql still do Bitmap Heap Scan.

(The other idea is to use raw_id as a primary key of "files" table to
don't extend index. But I don't know whether it is possible at all or
this idea have some drawbacks)

I think it worth to learn postgreql to do this trick especially taking
into account there are many questions about whether it is possible to
optimize such a queries.

If there is an known solution to this problem please provide a link to it.

With best regards,
Michael Mikhulya.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Matthew Wakeling

On Fri, 18 Dec 2009, Michael N. Mikhulya wrote:

The problem here is that we are forced to fetch "files" in Bitmap Heap Scan.
But actually there is no need for the whole "files" record. The
necessary data is only "files" ids.

The idea is to avoid fetching data from "files" table, and get the ids
from index! (probably it is a little bit tricky, but it is a
performance area...)


Unfortunately, the index does not contain enough information to accomplish 
this. This is due to Postgres' advanced concurrency control system. 
Postgres needs to fetch the actual rows from the files table in order to 
check whether that row is visible in the current transaction, and a Bitmap 
Index Scan is the fastest way to do this.


You can speed this up in Postgres 8.4 by having a RAID array and setting 
the effective_concurrency configuration to the number of spindles in the 
RAID array, or by having gobs of RAM and keeping everything in cache.


Matthew

--
A good programmer is one who looks both ways before crossing a one-way street.
Considering the quality and quantity of one-way streets in Cambridge, it
should be no surprise that there are so many good programmers there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Sigurgeir Gunnarsson
The intention was never to talk down postgresql but rather trying to get
some explanation of this difference so that I could do the proper changes.

After having read the link from Euler's post, which I oversaw, I have
managed to shorten the import time. My problem was with the indexes. I was
able to shorten the import time, of a 26 million line import, from 2 hours +
(I gave up after that time) downto 12 minutes by dropping the indexes after
truncate and before copy.

This is what I was expecting and I'm therefore satisfied with the result.

Regards, Sigurgeir

2009/12/18 Robert Haas 

> On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
>  wrote:
> > I hope the issue is still open though I haven't replied to it before.
> >
> > Euler mentioned that I did not provide any details about my system. I'm
> > using version 8.3 and with most settings default on an old machine with 2
> GB
> > of mem. The table definition is simple, four columns; id, value, x, y
> where
> > id is primary key and x, y are combined into an index.
> >
> > I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
> > instead of COPY. Regarding my comparison to MySQL, it is completely
> valid.
> > This is done on the same computer, using the same disk on the same
> platform.
> > From that I would derive that IO is not my problem, unless postgresql is
> > doing IO twice while MySQL only once.
> >
> > I guess my tables are InnoDB since that is the default type (or so I
> think).
> > BEGIN/COMMIT I did not find change much. Are there any other suggestions
> ?
>
> Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
> help a lot here.  To do that, you need to either create or truncate
> the table in the same transaction that does the COPY.
>
> The problem with the MySQL comparison is that it's not really
> relevant.   It isn't that the PostgreSQL code just sucks and if we
> wrote it properly it would be as fast as MySQL.  If that were the
> case, everyone would be up in arms, and it would have been fixed long
> ago.  Rather, the problem is almost certainly that it's not an
> apples-to-apples comparison.  MySQL is probably doing something
> different, such as perhaps not properly arranging for recovery if the
> system goes down in the middle of the copy, or just after it
> completes.  But I don't know MySQL well enough to know exactly what
> the difference is, and I'm not particularly interested in spending a
> lot of time figuring it out.  I think you'll get that reaction from
> others on this list as well, but of course that's up to them.
> Everybody here is a volunteer, of course, and generally our interest
> is principally PostgreSQL.
>
> On the other hand, we can certainly give you lots of information about
> what PostgreSQL is doing and why that takes the amount of time that it
> does, or give you information on how you can find out more about what
> it's doing.
>
> ...Robert
>


Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Michael N. Mikhulya
Thank you very much. I catch the point why it is done so.

But I'm curious whether it is still possible to don't fetch data from
files table just because inappropriate ids (e.g. removed ones) will
not produce any wrong effect just because them indirectly "checked" on
downloads table?
Here I mean that if we get id (from index) for file which is actually
removed, then we will not find anything in downloads table.
Probably my knowledge about MVCC is too little to see whole picture,
so if it is not hard to you please point the "failure" scenario (when
we get wrong result) or locking issue, ...

Michael Mikhulya

> Unfortunately, the index does not contain enough information to accomplish
> this. This is due to Postgres' advanced concurrency control system. Postgres
> needs to fetch the actual rows from the files table in order to check
> whether that row is visible in the current transaction, and a Bitmap Index
> Scan is the fastest way to do this.
>
> You can speed this up in Postgres 8.4 by having a RAID array and setting the
> effective_concurrency configuration to the number of spindles in the RAID
> array, or by having gobs of RAM and keeping everything in cache.
>
> Matthew
>
> --
> A good programmer is one who looks both ways before crossing a one-way
> street.
> Considering the quality and quantity of one-way streets in Cambridge, it
> should be no surprise that there are so many good programmers there.
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Idea how to get rid of Bitmap Heap Scan

2009-12-18 Thread Greg Stark
On Fri, Dec 18, 2009 at 4:18 PM, Michael N. Mikhulya
 wrote:
> Thank you very much. I catch the point why it is done so.
>
> But I'm curious whether it is still possible to don't fetch data from
> files table just because inappropriate ids (e.g. removed ones) will
> not produce any wrong effect just because them indirectly "checked" on
> downloads table?
> Here I mean that if we get id (from index) for file which is actually
> removed, then we will not find anything in downloads table.
> Probably my knowledge about MVCC is too little to see whole picture,
> so if it is not hard to you please point the "failure" scenario (when
> we get wrong result) or locking issue, ...


Yup this ought to be possible and fruitful, I believe Heikki already
produced a partial patch to this end. If you're interested in working
on it you could skim back in the logs and start with that. I don't
recall any special keywords to search on but it might be in one of the
threads for the "visibility map" or it might be under "index-only
scans".

A word of warning, in my experience the hardest part for changes like
this isn't the executor changes (which in this case wouldn't be far
from easy) but the planner changes to detect when this new plan would
be better.

-- 
greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
2009/12/18 Grzegorz Jaśkiewicz :
> 2009/12/18 Robert Haas :
>> 2009/12/18 Grzegorz Jaśkiewicz :
>>> On Fri, Dec 18, 2009 at 2:18 PM, Robert Haas  wrote:
>>>
 NOT IN is the only that really kills you as far as optimization is
 concerned.  IN can be transformed to a join.  NOT IN forces a NOT
 (subplan)-type plan, which bites - hard.
>>>
>>> in a well designed database (read: not abusing NULLs) - it can be done
>>> with joins too.
>>
>> But not by PostgreSQL, or so I believe.
>
> using left join ?

If at least one column in the subselect is strict, you can rewrite it
that way yourself, but the optimizer won't do it. I wish it did, but I
don't wish it badly enough to have written the code myself, and
apparently neither does anyone else.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issues with \copy from file

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 10:51 AM, Sigurgeir Gunnarsson
 wrote:
> The intention was never to talk down postgresql but rather trying to get
> some explanation of this difference so that I could do the proper changes.
>
> After having read the link from Euler's post, which I oversaw, I have
> managed to shorten the import time. My problem was with the indexes. I was
> able to shorten the import time, of a 26 million line import, from 2 hours +
> (I gave up after that time) downto 12 minutes by dropping the indexes after
> truncate and before copy.
>
> This is what I was expecting and I'm therefore satisfied with the result.

Ah ha!  Well, it sounds like perhaps you have the answer to what was
causing the difference too, then.  I'm not trying to be unhelpful,
just trying to explain honestly why you might not get exactly the
response you expect to MySQL comparisons - we only understand half of
it.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Tom Lane
Robert Haas  writes:
> If at least one column in the subselect is strict, you can rewrite it
> that way yourself, but the optimizer won't do it. I wish it did, but I
> don't wish it badly enough to have written the code myself, and
> apparently neither does anyone else.

I was thinking about this earlier today.  It's a bit of a PITA because
we need the information very early in the planner, before it's done much
analysis.  So for example we might find ourselves duplicating the work
that will happen later to determine which tables are nullable by outer
joins.  I think this would be all right as long as we ensure that it's
only done when there's a chance for a win (ie, no extra cycles if
there's not actually a NOT IN present).  It could still be an
unpleasantly large amount of new code though.

Wouldn't we need to enforce that *all* columns of the subselect are
non-null, rather than *any*?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Automatic optimization of IN clauses via INNER JOIN

2009-12-18 Thread Robert Haas
On Fri, Dec 18, 2009 at 7:32 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> If at least one column in the subselect is strict, you can rewrite it
>> that way yourself, but the optimizer won't do it. I wish it did, but I
>> don't wish it badly enough to have written the code myself, and
>> apparently neither does anyone else.
>
> I was thinking about this earlier today.  It's a bit of a PITA because
> we need the information very early in the planner, before it's done much
> analysis.  So for example we might find ourselves duplicating the work
> that will happen later to determine which tables are nullable by outer
> joins.  I think this would be all right as long as we ensure that it's
> only done when there's a chance for a win (ie, no extra cycles if
> there's not actually a NOT IN present).  It could still be an
> unpleasantly large amount of new code though.

I haven't looked at the code (I'm not even sure where you're thinking
this would need to happen) but is there any way that we can do this
and usefully hold onto the results for future use?

> Wouldn't we need to enforce that *all* columns of the subselect are
> non-null, rather than *any*?

[ thinks about it ]

Yes.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance