Re: [HACKERS] Using indexes for partial index builds

2013-05-26 Thread Jim Nasby

On 3/13/13 7:10 PM, Ants Aasma wrote:

On Thu, Mar 14, 2013 at 1:51 AM, Jim Nasby j...@nasby.net wrote:

On 3/12/13 9:10 AM, Ants Aasma wrote:

I have a feeling this is an increasingly widespread pattern with a
proliferation of mobile devices that need syncing.


If you're doing that with timestamps you're asking for a slew of problems,
not all of which can be solved by just adding some random amount of fluff to
your criteria. A queue-based solution is often a more robust solution, even
if it is harder to implement.


Do you know of anything else besides the obvious issues with having to
use one clocksource and ensure that it produces monotonic timestamps?


Those issues aren't enough? :)


My first reaction was also that this is what queues are meant for, but
the proposed solution seems to work surprisingly well. Unless you can
point at some glaring hole that I'm missing I would say that it is
good enough for a rather wide range of syncing problems.


It depends on how critical it is not to miss events. Timestamps in tables are 
always taken before transaction commit, so you can sometimes have a significant 
delay. You have to make certain the timestamp can't be changed, and that rows 
can't be deleted. It's also tricky to make certain you don't see any events 
twice.

Given all that, and how easy PgQ is to use, I don't understand why anyone would 
go with timestamps...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-13 Thread Jim Nasby

On 3/12/13 9:10 AM, Ants Aasma wrote:

I have a feeling this is an increasingly widespread pattern with a
proliferation of mobile devices that need syncing.


If you're doing that with timestamps you're asking for a slew of problems, not 
all of which can be solved by just adding some random amount of fluff to your 
criteria. A queue-based solution is often a more robust solution, even if it is 
harder to implement.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-13 Thread Ants Aasma
On Thu, Mar 14, 2013 at 1:51 AM, Jim Nasby j...@nasby.net wrote:
 On 3/12/13 9:10 AM, Ants Aasma wrote:
 I have a feeling this is an increasingly widespread pattern with a
 proliferation of mobile devices that need syncing.

 If you're doing that with timestamps you're asking for a slew of problems,
 not all of which can be solved by just adding some random amount of fluff to
 your criteria. A queue-based solution is often a more robust solution, even
 if it is harder to implement.

Do you know of anything else besides the obvious issues with having to
use one clocksource and ensure that it produces monotonic timestamps?
My first reaction was also that this is what queues are meant for, but
the proposed solution seems to work surprisingly well. Unless you can
point at some glaring hole that I'm missing I would say that it is
good enough for a rather wide range of syncing problems.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-12 Thread Ants Aasma
On Mon, Mar 11, 2013 at 9:13 PM, Greg Stark st...@mit.edu wrote:
 On Thu, Mar 7, 2013 at 12:51 AM, Jim Nasby j...@nasby.net wrote:
 Something worth considering on this... I suspect it's possible to use an
 index-only scan to do this, regardless of whether the heap page is all
 visible. The reason is that the newly created index would just use the same
 access methodology as the original index, so any dead rows would be ignored.

 This is actually quite clever. I wonder how many other cases can use
 similar logic.

I actually just dealt with a case where this would have been helpful.
The case is finding rows from a huge table where a foreign key
reference matches one from a list of IDs and the last change date is
larger than some specific value. The best plan for this is to build
bitmaps for both conditions. The performance issue is that the bitmap
for the IDs can get pretty large and expensive to construct. Solution
for that is to keep a partial index on the foreign key predicated on a
recent timestamp covering most queries, removing 99% of tuples from
consideration. This index needs to be periodically replaced with one
that has a newer timestamp. A full table scan could be avoided if the
index could be built using the previous partial index. Now this was on
9.1, so I didn't consider index only, but on 9.2+ I would add the
timestamp to the foreign key index, then the new index could be
constructed using an index only scan.

I have a feeling this is an increasingly widespread pattern with a
proliferation of mobile devices that need syncing.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-11 Thread Greg Stark
On Thu, Mar 7, 2013 at 12:51 AM, Jim Nasby j...@nasby.net wrote:
 Something worth considering on this... I suspect it's possible to use an
 index-only scan to do this, regardless of whether the heap page is all
 visible. The reason is that the newly created index would just use the same
 access methodology as the original index, so any dead rows would be ignored.

This is actually quite clever. I wonder how many other cases can use
similar logic.

 We'd almost certainly need to block vacuums during the build however.
 Obviously not an issue for regular index builds, but it would be for
 concurrent ones.

Concurrent index builds block vacuums already.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-06 Thread Jim Nasby

On 2/2/13 4:05 AM, Paul Norman wrote:

Hello,
After a discussion on IRC in #postgresql, I had a feature suggestion and it
was suggested I write it up here.

I have a large (200GB, 1.7b rows) table with a number of columns, but the
two of interest here are a hstore column, tags and a postgis geometry
column, geom.

There is a GIN index on tags and a gist index on geom. These took about
36-48 hours to build in total. Obviously index building on a table this size
is not trivial.

Periodically I want to do a number of specialized queries on objects with a
particular tag or in a particular area. To do this I often want to create a
partial index. For example, I created the index btree ((tags -
'name_1'::text) text_pattern_ops) WHERE tags ? 'name_1'::text. My
understanding is to create this index PostgreSQL does a scan of the entire
table, even though the GIN index on tags could be used to identify which
rows could belong in the index. Where the WHERE condition selects only a
small portion of the table this is scanning a lot more data than is
necessary.

Another case where it would be useful is when I am conducting a detailed
analysis of some aspect of the rows in a particular city. This leads to all
the queries being of the form SELECT ... FROM ... WHERE
is_in_my_area(geom)[1].

My current project is doing analysis involving addresses. The ability to
create an index like btree((tags - 'addr:housenumber'), (tags -
'addr:street'), (tags - 'addr:city')) WHERE is_in_my_area(geom) in a
reasonable time would allow me to use a view instead of copying the local
area to a temporary table and indexing that table. The local area is about
350k rows, or about 0.02% of the database.

[1] The actual function for determining if it's in my area is long and not
really essential to the point here.


Something worth considering on this... I suspect it's possible to use an 
index-only scan to do this, regardless of whether the heap page is all visible. 
The reason is that the newly created index would just use the same access 
methodology as the original index, so any dead rows would be ignored.

We'd almost certainly need to block vacuums during the build however. Obviously 
not an issue for regular index builds, but it would be for concurrent ones.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Using indexes for partial index builds

2013-02-02 Thread Paul Norman
Hello,
After a discussion on IRC in #postgresql, I had a feature suggestion and it
was suggested I write it up here.

I have a large (200GB, 1.7b rows) table with a number of columns, but the
two of interest here are a hstore column, tags and a postgis geometry
column, geom. 

There is a GIN index on tags and a gist index on geom. These took about
36-48 hours to build in total. Obviously index building on a table this size
is not trivial.

Periodically I want to do a number of specialized queries on objects with a
particular tag or in a particular area. To do this I often want to create a
partial index. For example, I created the index btree ((tags -
'name_1'::text) text_pattern_ops) WHERE tags ? 'name_1'::text. My
understanding is to create this index PostgreSQL does a scan of the entire
table, even though the GIN index on tags could be used to identify which
rows could belong in the index. Where the WHERE condition selects only a
small portion of the table this is scanning a lot more data than is
necessary.

Another case where it would be useful is when I am conducting a detailed
analysis of some aspect of the rows in a particular city. This leads to all
the queries being of the form SELECT ... FROM ... WHERE
is_in_my_area(geom)[1]. 

My current project is doing analysis involving addresses. The ability to
create an index like btree((tags - 'addr:housenumber'), (tags -
'addr:street'), (tags - 'addr:city')) WHERE is_in_my_area(geom) in a
reasonable time would allow me to use a view instead of copying the local
area to a temporary table and indexing that table. The local area is about
350k rows, or about 0.02% of the database.

[1] The actual function for determining if it's in my area is long and not
really essential to the point here.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers