Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-28 Thread CG


--- Bernhard Weisshuhn [EMAIL PROTECTED] wrote:

 On Mon, Feb 27, 2006 at 10:27:20AM -0800, CG [EMAIL PROTECTED] wrote:
 
  [...] I'd need to see if the space required for the varchar+btree tables
 are
  comparible, better, or worse than the ltree+gist tables with regards to
 size.
 
 Please test this, I'm guessing (hoping actually) that having bazillions of
 combinations of 26 (or so) characters (ltree labels) might be consuming
 less space than having bazillions of substings in the database.
 
 Or maybe some clever combination of both approaches?
 
 If you find out something interesting, please let me know.

Performance using varchar+btree, breaking up the string into distinct letter
groups = 3 chars is slightly better. Size of the varchar search vector table
table is much bigger.. Most of my fields are about 15-25 characters in length.
Expect even bigger tables for longer fields. The size of the btree index is
less. The time to bootstrap the data into the tables was significantly longer.
I used two triggers, one that normalized the search field before insert, and
another that inserted a breakdown row after the insert row. There's a recursive
effect built-in to get down to the smallest unique element.

I'm sticking with ltree and setting up a vacuum analyze on a cron to keep the
searches snappy. Hope that helps you with your project!

CG

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-27 Thread CG
Tsearch2 searches for whole words, and is designed with language in mind, yes? 

I'm looking for consecutive characters in words or serial numbers, etc. 

As for support, the same guys who wrote Tsearch2 wrote ltree. Can't go wrong
there!

Here's the solution to this problem: As usual, operator error. :(

For some reason I thought it would be a good idea to cluster the table on the
item_id index... What in the world was I thinking? When I clustered the search
table on the search_vector index (which makes the most sense, yes?) it seemed
to bring actual performance in-line with the type of performance I imagined
that I would receive.

I could probably get even better performance out of the table, at the cost of a
significant increase in table and index size, by chopping up the columns into
smaller chunks.

Hello World would yield

'h.e.l.l.o.w.o.r.l.d'
'e.l.l.o.w.o.r.l.d'
'l.l.o.w.o.r.l.d'
'l.o.w.o.r.l.d'
'o.w.o.r.l.d'
'w.o.r.l.d'
'o.r.l.d'
'r.l.d'

and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to the
vectors which start with o.r.l ... 

Thanks for all the responses! They did get my head pointed in the right
direction.

CG

--- Jim C. Nasby [EMAIL PROTECTED] wrote:

 On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote:
  I have a search table which I use for partial-match text searches:
 snip
  For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d'
 ...
  If I wanted to find all rows with orl in them i would construct an lquery
  like '*.o.r.l.*' and use the ~ operator in the where clause. I would link
 to
  the table items by the item_id ... 
 
 Is there some reason you can't use tsearch2? I suspect it would probably
 work better; if nothing else you'd probably get better support since a
 lot more people use it.
 -- 
 Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
 Pervasive Software  http://pervasive.comwork: 512-231-6117
 vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461
 



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-27 Thread Bernhard Weisshuhn
On Mon, Feb 27, 2006 at 09:14:40AM -0800, CG [EMAIL PROTECTED] wrote:

 I could probably get even better performance out of the table, at the cost of 
 a
 significant increase in table and index size, by chopping up the columns into
 smaller chunks.
 
 Hello World would yield
 
 'h.e.l.l.o.w.o.r.l.d'
 'e.l.l.o.w.o.r.l.d'
 'l.l.o.w.o.r.l.d'
 'l.o.w.o.r.l.d'
 'o.w.o.r.l.d'
 'w.o.r.l.d'
 'o.r.l.d'
 'r.l.d'
 
 and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to 
 the
 vectors which start with o.r.l ... 

But with this approch you'd be fine with a normal varchar_ops btree index
for textfields and searching using like 'world%', wouldn't you?
Or is the ltree approch more efficient?

I'm not trying to be smart-assed, it's a naive question, since I'm
looking for an efficient substring search solution in postgresql myself.

regards,
bkw


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-27 Thread CG

That would do the job, wouldn't it? :)

I don't think it's a naive question at all. Its quite a good question, and the
solution you suggest is a good option to have, and would probably work better
than the single-vector ltree index for simple substring matching. In my case,
the ltree+gist index table actually contains more pages than the table of data
itself. I'd need to see if the space required for the varchar+btree tables are
comparible, better, or worse than the ltree+gist tables with regards to size.
Now that I think about it, building substrings out of ltree nodes would be
incredible overkill comapred to the effetiveness of the varchar+btree.

The extra advantages of ltree are the ability to match and extract nodes in a
path based not only on contents but also proximity, and aggregate on those
characteristics. 

In my case this might be good for serial numbers where each digit or grouping
of digits have special values which would be used to aggregate on.

The ltree method was suggested to me a while back when I was frustrated with
the performance of like '%something%' ... 



--- Bernhard Weisshuhn [EMAIL PROTECTED] wrote:

 On Mon, Feb 27, 2006 at 09:14:40AM -0800, CG [EMAIL PROTECTED] wrote:
 
  I could probably get even better performance out of the table, at the cost
 of a
  significant increase in table and index size, by chopping up the columns
 into
  smaller chunks.
  
  Hello World would yield
  
  'h.e.l.l.o.w.o.r.l.d'
  'e.l.l.o.w.o.r.l.d'
  'l.l.o.w.o.r.l.d'
  'l.o.w.o.r.l.d'
  'o.w.o.r.l.d'
  'w.o.r.l.d'
  'o.r.l.d'
  'r.l.d'
  
  and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to
 the
  vectors which start with o.r.l ... 
 
 But with this approch you'd be fine with a normal varchar_ops btree index
 for textfields and searching using like 'world%', wouldn't you?
 Or is the ltree approch more efficient?
 
 I'm not trying to be smart-assed, it's a naive question, since I'm
 looking for an efficient substring search solution in postgresql myself.
 
 regards,
 bkw
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-27 Thread Bernhard Weisshuhn
On Mon, Feb 27, 2006 at 10:27:20AM -0800, CG [EMAIL PROTECTED] wrote:

 [...] I'd need to see if the space required for the varchar+btree tables are
 comparible, better, or worse than the ltree+gist tables with regards to size.

Please test this, I'm guessing (hoping actually) that having bazillions of
combinations of 26 (or so) characters (ltree labels) might be consuming
less space than having bazillions of substings in the database.

Or maybe some clever combination of both approaches?

If you find out something interesting, please let me know.

regards,
bkw


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-25 Thread Jim C. Nasby
On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote:
 I have a search table which I use for partial-match text searches:
snip
 For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d' ...
 If I wanted to find all rows with orl in them i would construct an lquery
 like '*.o.r.l.*' and use the ~ operator in the where clause. I would link to
 the table items by the item_id ... 

Is there some reason you can't use tsearch2? I suspect it would probably
work better; if nothing else you'd probably get better support since a
lot more people use it.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread CG
PostgreSQL 8.1.3

I'm trying to collect some hard numbers to show just how much it degrades and
over how long a time interval.

All I have now is anecdotal evidence, and I was hoping to save myself some
downtime by seeking advice early.

I have a search table which I use for partial-match text searches:

CREATE TABLE search
(
  id int4 NOT NULL DEFAULT nextval('search_id_seq'::regclass),
  item_id int8 NOT NULL,
  search_vector ltree NOT NULL,
  CONSTRAINT search_id_pkey PRIMARY KEY (id),
  CONSTRAINT search_item_id_fkey FOREIGN KEY (item_id)
  REFERENCES items (id) MATCH SIMPLE
  ON UPDATE CASCADE ON DELETE CASCADE
) 
WITH OIDS;

CREATE INDEX lsearch_vector_idx
  ON search
  USING gist
  (search_vector);

I have some triggers that insert rows into the search table as rows are
inserted into items.

I implimented this yesterday, and the immediate effect was a fantastic return
time for partial text searches in the sub-second range. By today, these queries
take 10 minutes sometimes... There are about 134000 rows in the table.

The table gets analyzed nightly. Should the frequency be more? There are about
1000 rows added a day, only about 30 or so rows removed, and nothing is ever
updated. There's not that much turnover.

The search vectors are built like this:

For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d' ...
If I wanted to find all rows with orl in them i would construct an lquery
like '*.o.r.l.*' and use the ~ operator in the where clause. I would link to
the table items by the item_id ... 

What could be making this go so wrong? Is there a better way to accomplish my
task?

CG

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread Martijn van Oosterhout
On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote:
 PostgreSQL 8.1.3
 
 I'm trying to collect some hard numbers to show just how much it degrades and
 over how long a time interval.
 
 All I have now is anecdotal evidence, and I was hoping to save myself some
 downtime by seeking advice early.

snip

 I implimented this yesterday, and the immediate effect was a fantastic return
 time for partial text searches in the sub-second range. By today, these 
 queries
 take 10 minutes sometimes... There are about 134000 rows in the table.
 
 The table gets analyzed nightly. Should the frequency be more? There are about
 1000 rows added a day, only about 30 or so rows removed, and nothing is ever
 updated. There's not that much turnover.

That's very odd. Like the other person said, do you vacuum and analyse?
But my question is: is it using the index? What does EXPLAIN / EXPLAIN
ANALYZE tell you?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread CG
--- Scott Marlowe [EMAIL PROTECTED] wrote:

 Are you vacuuming regularly, are your fsm settings high enough, and what
 does vacuum verbose say?

Autovacuum is running, but I do a nightly vacuum analyze. When I just do a
vacuum analyze on the table I get:

data=# vacuum analyze verbose search;
INFO:  vacuuming search
INFO:  index search_id_pkey now contains 1344661 row versions in 5134 pages
DETAIL:  9 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.09s/0.07u sec elapsed 4.91 sec.
INFO:  index search_vector_idx now contains 1344672 row versions in 47725
pages
DETAIL:  9 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 1.77s/0.37u sec elapsed 407.55 sec.
INFO:  index search_item_id_idx now contains 1344690 row versions in 6652
pages
DETAIL:  9 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.24s/0.08u sec elapsed 45.62 sec.
INFO:  search: removed 9 row versions in 2 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.05 sec.
INFO:  letter_search: found 9 removable, 1344661 nonremovable row versions in
33984 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 141 unused item pointers.
0 pages are entirely empty.
CPU 2.41s/0.62u sec elapsed 483.06 sec.
INFO:  vacuuming pg_toast.pg_toast_174918394
INFO:  index pg_toast_174918394_index now contains 0 row versions in 1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  pg_toast_174918394: found 0 removable, 0 nonremovable row versions in
0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  analyzing search
INFO:  search: scanned 3000 of 33984 pages, containing 119035 live rows and 0
dead rows; 3000 rows in sample, 1348428 estimated total rows
VACUUM

max_fsm_pages = 6   # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 2000# min 100, ~70 bytes each




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread CG


--- Martijn van Oosterhout kleptog@svana.org wrote:

 That's very odd. Like the other person said, do you vacuum and analyse?
 But my question is: is it using the index? What does EXPLAIN / EXPLAIN
 ANALYZE tell you?

data=# explain select * from search where search_vector ~ '*.o.r.l.*'::lquery;
QUERY PLAN
---
 Bitmap Heap Scan on search  (cost=53.71..4566.65 rows=1345 width=161)
   Recheck Cond: (search_vector ~ '*.o.r.l.*'::lquery)
   -  Bitmap Index Scan on search_vector_idx  (cost=0.00..53.71 rows=1345
width=0)
 Index Cond: (search_vector ~ '*.o.r.l.*'::lquery)
(4 rows)

data=# explain analyze select * from search where search_vector ~
'*.o.r.l.*'::lquery;
QUERY PLAN
--
 Bitmap Heap Scan on search  (cost=53.71..4566.65 rows=1345 width=161) (actual
time=183684.156..196997.278 rows=1655 loops=1)
   Recheck Cond: (search_vector ~ '*.o.r.l.*'::lquery)
   -  Bitmap Index Scan on search_vector_idx  (cost=0.00..53.71 rows=1345
width=0) (actual time=183683.857..183683.857 rows=1655 loops=1)
 Index Cond: (search_vector ~ '*.o.r.l.*'::lquery)
 Total runtime: 197000.061 ms
(5 rows)

I appreciate you taking the time to help me out. Thank you all.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread Tom Lane
CG [EMAIL PROTECTED] writes:
 INFO:  index search_vector_idx now contains 1344672 row versions in 47725 
 pages
 INFO:  letter_search: found 9 removable, 1344661 nonremovable row versions 
 in 33984 pages

Yikes ... the index is bigger than the table!  Seems like you've not
been vacuuming enough, or else gist has got a major bloat problem.
If you REINDEX, does the index get materially smaller?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] ltree + gist index performance degrades significantly over a night

2006-02-24 Thread Martijn van Oosterhout
On Fri, Feb 24, 2006 at 09:44:37AM -0800, CG wrote:
 
 
 --- Martijn van Oosterhout kleptog@svana.org wrote:
 
  That's very odd. Like the other person said, do you vacuum and analyse?
  But my question is: is it using the index? What does EXPLAIN / EXPLAIN
  ANALYZE tell you?
 
 data=# explain select * from search where search_vector ~ '*.o.r.l.*'::lquery;

snip

 data=# explain analyze select * from search where search_vector ~
 '*.o.r.l.*'::lquery;

snip
  Total runtime: 197000.061 ms

Ouch! The index is obviously not very good in this case. Unfortunatly
it's not clear where the slowdown is. You'd probably need to recompile
postgresql with profiling to find exactly where it's going wrong.

Quick test though, if you disable the index (set enable_indexscan=off)
so it does a seqscan, is it faster or slower? By how much? i.e. is it
helping much.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature