Re: [GENERAL] ltree + gist index performance degrades significantly over a night
--- Bernhard Weisshuhn [EMAIL PROTECTED] wrote: On Mon, Feb 27, 2006 at 10:27:20AM -0800, CG [EMAIL PROTECTED] wrote: [...] I'd need to see if the space required for the varchar+btree tables are comparible, better, or worse than the ltree+gist tables with regards to size. Please test this, I'm guessing (hoping actually) that having bazillions of combinations of 26 (or so) characters (ltree labels) might be consuming less space than having bazillions of substings in the database. Or maybe some clever combination of both approaches? If you find out something interesting, please let me know. Performance using varchar+btree, breaking up the string into distinct letter groups = 3 chars is slightly better. Size of the varchar search vector table table is much bigger.. Most of my fields are about 15-25 characters in length. Expect even bigger tables for longer fields. The size of the btree index is less. The time to bootstrap the data into the tables was significantly longer. I used two triggers, one that normalized the search field before insert, and another that inserted a breakdown row after the insert row. There's a recursive effect built-in to get down to the smallest unique element. I'm sticking with ltree and setting up a vacuum analyze on a cron to keep the searches snappy. Hope that helps you with your project! CG __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
Tsearch2 searches for whole words, and is designed with language in mind, yes? I'm looking for consecutive characters in words or serial numbers, etc. As for support, the same guys who wrote Tsearch2 wrote ltree. Can't go wrong there! Here's the solution to this problem: As usual, operator error. :( For some reason I thought it would be a good idea to cluster the table on the item_id index... What in the world was I thinking? When I clustered the search table on the search_vector index (which makes the most sense, yes?) it seemed to bring actual performance in-line with the type of performance I imagined that I would receive. I could probably get even better performance out of the table, at the cost of a significant increase in table and index size, by chopping up the columns into smaller chunks. Hello World would yield 'h.e.l.l.o.w.o.r.l.d' 'e.l.l.o.w.o.r.l.d' 'l.l.o.w.o.r.l.d' 'l.o.w.o.r.l.d' 'o.w.o.r.l.d' 'w.o.r.l.d' 'o.r.l.d' 'r.l.d' and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to the vectors which start with o.r.l ... Thanks for all the responses! They did get my head pointed in the right direction. CG --- Jim C. Nasby [EMAIL PROTECTED] wrote: On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote: I have a search table which I use for partial-match text searches: snip For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d' ... If I wanted to find all rows with orl in them i would construct an lquery like '*.o.r.l.*' and use the ~ operator in the where clause. I would link to the table items by the item_id ... Is there some reason you can't use tsearch2? I suspect it would probably work better; if nothing else you'd probably get better support since a lot more people use it. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
On Mon, Feb 27, 2006 at 09:14:40AM -0800, CG [EMAIL PROTECTED] wrote: I could probably get even better performance out of the table, at the cost of a significant increase in table and index size, by chopping up the columns into smaller chunks. Hello World would yield 'h.e.l.l.o.w.o.r.l.d' 'e.l.l.o.w.o.r.l.d' 'l.l.o.w.o.r.l.d' 'l.o.w.o.r.l.d' 'o.w.o.r.l.d' 'w.o.r.l.d' 'o.r.l.d' 'r.l.d' and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to the vectors which start with o.r.l ... But with this approch you'd be fine with a normal varchar_ops btree index for textfields and searching using like 'world%', wouldn't you? Or is the ltree approch more efficient? I'm not trying to be smart-assed, it's a naive question, since I'm looking for an efficient substring search solution in postgresql myself. regards, bkw ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
That would do the job, wouldn't it? :) I don't think it's a naive question at all. Its quite a good question, and the solution you suggest is a good option to have, and would probably work better than the single-vector ltree index for simple substring matching. In my case, the ltree+gist index table actually contains more pages than the table of data itself. I'd need to see if the space required for the varchar+btree tables are comparible, better, or worse than the ltree+gist tables with regards to size. Now that I think about it, building substrings out of ltree nodes would be incredible overkill comapred to the effetiveness of the varchar+btree. The extra advantages of ltree are the ability to match and extract nodes in a path based not only on contents but also proximity, and aggregate on those characteristics. In my case this might be good for serial numbers where each digit or grouping of digits have special values which would be used to aggregate on. The ltree method was suggested to me a while back when I was frustrated with the performance of like '%something%' ... --- Bernhard Weisshuhn [EMAIL PROTECTED] wrote: On Mon, Feb 27, 2006 at 09:14:40AM -0800, CG [EMAIL PROTECTED] wrote: I could probably get even better performance out of the table, at the cost of a significant increase in table and index size, by chopping up the columns into smaller chunks. Hello World would yield 'h.e.l.l.o.w.o.r.l.d' 'e.l.l.o.w.o.r.l.d' 'l.l.o.w.o.r.l.d' 'l.o.w.o.r.l.d' 'o.w.o.r.l.d' 'w.o.r.l.d' 'o.r.l.d' 'r.l.d' and using a wildcard search search_vector ~ 'o.r.l.*' would jump right to the vectors which start with o.r.l ... But with this approch you'd be fine with a normal varchar_ops btree index for textfields and searching using like 'world%', wouldn't you? Or is the ltree approch more efficient? I'm not trying to be smart-assed, it's a naive question, since I'm looking for an efficient substring search solution in postgresql myself. regards, bkw __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
On Mon, Feb 27, 2006 at 10:27:20AM -0800, CG [EMAIL PROTECTED] wrote: [...] I'd need to see if the space required for the varchar+btree tables are comparible, better, or worse than the ltree+gist tables with regards to size. Please test this, I'm guessing (hoping actually) that having bazillions of combinations of 26 (or so) characters (ltree labels) might be consuming less space than having bazillions of substings in the database. Or maybe some clever combination of both approaches? If you find out something interesting, please let me know. regards, bkw ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote: I have a search table which I use for partial-match text searches: snip For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d' ... If I wanted to find all rows with orl in them i would construct an lquery like '*.o.r.l.*' and use the ~ operator in the where clause. I would link to the table items by the item_id ... Is there some reason you can't use tsearch2? I suspect it would probably work better; if nothing else you'd probably get better support since a lot more people use it. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[GENERAL] ltree + gist index performance degrades significantly over a night
PostgreSQL 8.1.3 I'm trying to collect some hard numbers to show just how much it degrades and over how long a time interval. All I have now is anecdotal evidence, and I was hoping to save myself some downtime by seeking advice early. I have a search table which I use for partial-match text searches: CREATE TABLE search ( id int4 NOT NULL DEFAULT nextval('search_id_seq'::regclass), item_id int8 NOT NULL, search_vector ltree NOT NULL, CONSTRAINT search_id_pkey PRIMARY KEY (id), CONSTRAINT search_item_id_fkey FOREIGN KEY (item_id) REFERENCES items (id) MATCH SIMPLE ON UPDATE CASCADE ON DELETE CASCADE ) WITH OIDS; CREATE INDEX lsearch_vector_idx ON search USING gist (search_vector); I have some triggers that insert rows into the search table as rows are inserted into items. I implimented this yesterday, and the immediate effect was a fantastic return time for partial text searches in the sub-second range. By today, these queries take 10 minutes sometimes... There are about 134000 rows in the table. The table gets analyzed nightly. Should the frequency be more? There are about 1000 rows added a day, only about 30 or so rows removed, and nothing is ever updated. There's not that much turnover. The search vectors are built like this: For a string Hello World the ltree is created like 'h.e.l.l.o.w.o.r.l.d' ... If I wanted to find all rows with orl in them i would construct an lquery like '*.o.r.l.*' and use the ~ operator in the where clause. I would link to the table items by the item_id ... What could be making this go so wrong? Is there a better way to accomplish my task? CG __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
On Fri, Feb 24, 2006 at 09:02:04AM -0800, CG wrote: PostgreSQL 8.1.3 I'm trying to collect some hard numbers to show just how much it degrades and over how long a time interval. All I have now is anecdotal evidence, and I was hoping to save myself some downtime by seeking advice early. snip I implimented this yesterday, and the immediate effect was a fantastic return time for partial text searches in the sub-second range. By today, these queries take 10 minutes sometimes... There are about 134000 rows in the table. The table gets analyzed nightly. Should the frequency be more? There are about 1000 rows added a day, only about 30 or so rows removed, and nothing is ever updated. There's not that much turnover. That's very odd. Like the other person said, do you vacuum and analyse? But my question is: is it using the index? What does EXPLAIN / EXPLAIN ANALYZE tell you? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. signature.asc Description: Digital signature
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
--- Scott Marlowe [EMAIL PROTECTED] wrote: Are you vacuuming regularly, are your fsm settings high enough, and what does vacuum verbose say? Autovacuum is running, but I do a nightly vacuum analyze. When I just do a vacuum analyze on the table I get: data=# vacuum analyze verbose search; INFO: vacuuming search INFO: index search_id_pkey now contains 1344661 row versions in 5134 pages DETAIL: 9 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.09s/0.07u sec elapsed 4.91 sec. INFO: index search_vector_idx now contains 1344672 row versions in 47725 pages DETAIL: 9 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 1.77s/0.37u sec elapsed 407.55 sec. INFO: index search_item_id_idx now contains 1344690 row versions in 6652 pages DETAIL: 9 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.24s/0.08u sec elapsed 45.62 sec. INFO: search: removed 9 row versions in 2 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.05 sec. INFO: letter_search: found 9 removable, 1344661 nonremovable row versions in 33984 pages DETAIL: 0 dead row versions cannot be removed yet. There were 141 unused item pointers. 0 pages are entirely empty. CPU 2.41s/0.62u sec elapsed 483.06 sec. INFO: vacuuming pg_toast.pg_toast_174918394 INFO: index pg_toast_174918394_index now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: pg_toast_174918394: found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: analyzing search INFO: search: scanned 3000 of 33984 pages, containing 119035 live rows and 0 dead rows; 3000 rows in sample, 1348428 estimated total rows VACUUM max_fsm_pages = 6 # min max_fsm_relations*16, 6 bytes each max_fsm_relations = 2000# min 100, ~70 bytes each __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
--- Martijn van Oosterhout kleptog@svana.org wrote: That's very odd. Like the other person said, do you vacuum and analyse? But my question is: is it using the index? What does EXPLAIN / EXPLAIN ANALYZE tell you? data=# explain select * from search where search_vector ~ '*.o.r.l.*'::lquery; QUERY PLAN --- Bitmap Heap Scan on search (cost=53.71..4566.65 rows=1345 width=161) Recheck Cond: (search_vector ~ '*.o.r.l.*'::lquery) - Bitmap Index Scan on search_vector_idx (cost=0.00..53.71 rows=1345 width=0) Index Cond: (search_vector ~ '*.o.r.l.*'::lquery) (4 rows) data=# explain analyze select * from search where search_vector ~ '*.o.r.l.*'::lquery; QUERY PLAN -- Bitmap Heap Scan on search (cost=53.71..4566.65 rows=1345 width=161) (actual time=183684.156..196997.278 rows=1655 loops=1) Recheck Cond: (search_vector ~ '*.o.r.l.*'::lquery) - Bitmap Index Scan on search_vector_idx (cost=0.00..53.71 rows=1345 width=0) (actual time=183683.857..183683.857 rows=1655 loops=1) Index Cond: (search_vector ~ '*.o.r.l.*'::lquery) Total runtime: 197000.061 ms (5 rows) I appreciate you taking the time to help me out. Thank you all. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
CG [EMAIL PROTECTED] writes: INFO: index search_vector_idx now contains 1344672 row versions in 47725 pages INFO: letter_search: found 9 removable, 1344661 nonremovable row versions in 33984 pages Yikes ... the index is bigger than the table! Seems like you've not been vacuuming enough, or else gist has got a major bloat problem. If you REINDEX, does the index get materially smaller? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] ltree + gist index performance degrades significantly over a night
On Fri, Feb 24, 2006 at 09:44:37AM -0800, CG wrote: --- Martijn van Oosterhout kleptog@svana.org wrote: That's very odd. Like the other person said, do you vacuum and analyse? But my question is: is it using the index? What does EXPLAIN / EXPLAIN ANALYZE tell you? data=# explain select * from search where search_vector ~ '*.o.r.l.*'::lquery; snip data=# explain analyze select * from search where search_vector ~ '*.o.r.l.*'::lquery; snip Total runtime: 197000.061 ms Ouch! The index is obviously not very good in this case. Unfortunatly it's not clear where the slowdown is. You'd probably need to recompile postgresql with profiling to find exactly where it's going wrong. Quick test though, if you disable the index (set enable_indexscan=off) so it does a seqscan, is it faster or slower? By how much? i.e. is it helping much. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. signature.asc Description: Digital signature