Re: [PERFORM] What is the best way to do attribute/values?
Josh Berkus wrote: Things we've already tried to avoid going over old ground: 1) increasing statistics; 2) increasing sort_mem (to 256MB, which is overkill) 3) testing on 8.0 beta, which does not affect the issue. At this point I'm looking for ideas. Suggestions, anyone? with respect to query design: consider instead of: select pav1.person_id from person_attributes_vertical pav1, person_attributes_vertical pav2 where pav1.attribute_id = 1 and pav1.value_id in (2,3) and pav2.attribute_id = 2 and pav2.value_id in (2,3) and pav1.person_id = pav2.person_id try: select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or (pav1.attribute_id = 2 and pav1.value_id in (2,3)) I am gambling that the 'or's' might be less expensive than the multiple self joins (particularly in the more general cases!). To make access work well you might want to have *several* concatenated indexes of 2 - 4 attributes - to work around Pg inability to use more than 1 in a given query. For this query indexing (attribute_id, value_id) is probably good. Consider playing with 'random_page_cost' and maybe 'effective_cache_size' to encourage the planner to use 'em. regards Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] What is the best way to do attribute/values?
On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote: select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or (pav1.attribute_id = 2 and pav1.value_id in (2,3)) You know.. It may help if you toss in a group by ie select pav1.person_id, count(*) from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or ( ... ) or (...) group by pav1.person_id order by count(*) desc that should give you the person_id's that matched the most criteria I've used similar things before now that I've thought about it. If you want an exact match you could put having count(*) = $myNumAttributes in there too.. By definition an exact match would match that definition.. it has an added side effect of producing closest matches when an exact match cannot be found... granted you may not want that for a dating site : ) You asked for a blond female, blue eyes.. but I couldn't find any... but I *DID* find a brown haired male with brown eyes! Is that good enough? -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] What is the best way to do attribute/values?
Hi, On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote: select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or (pav1.attribute_id = 2 and pav1.value_id in (2,3)) [...] Why not combine attribute_id and value_id? Then you have nothing but an OR (or IN). It should, AFAICS, give you much better selectivity on your indexes: There will be a lot of attributes with the same ID; there will also be a lot of attributes with the same value. However, there should be much less attributes with a specific combination of (ID/Value). Right now I think it will be very hard to determine which field has a better selectivity: attribute_id or value_id. The combined attribute/value field could be an int8 or so, where the upper 4 bytes are for attribute_id and the lower 4 bytes for value_id. Depending on the number of attributes and possible values a smaller datatype and / or a different split can be made. A smaller datatype will result in faster access. What difference does that make? regards, --Tim ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] What is the best way to do attribute/values?
Mark, Tim, select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or (pav1.attribute_id = 2 and pav1.value_id in (2,3)) Not the same query, sorry. Daniel's query yields all the person_id's which have criteria A AND criteria B. Yours gives all the person_id's which have criteria A OR criteria B. There will be a lot of attributes with the same ID; there will also be a lot of attributes with the same value. However, there should be much less attributes with a specific combination of (ID/Value). Right now I think it will be very hard to determine which field has a better selectivity: attribute_id or value_id. Given that there is already an index on ( attribute_id, value_id ) I don't quite see what difference this makes. Unless you're suggesting this as a workaround for the PG Planner's poor use of the index? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] What is the best way to do attribute/values?
Josh Berkus wrote: Mark, Tim, select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or (pav1.attribute_id = 2 and pav1.value_id in (2,3)) Not the same query, sorry. Daniel's query yields all the person_id's which have criteria A AND criteria B. Yours gives all the person_id's which have criteria A OR criteria B. Apologies, not thinking clearly enough there... Maybe try out intersection : select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) intersect select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 2 and pav1.value_id in (2,3)) In the advent that is unhelpful, I wonder about simplifying the situation and investigating how select pav1.person_id from person_attributes_vertical pav1 where pav1.attribute_id = 1 performs, compared to select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) If the first performs ok and the second does not, It may be possible to get better times by doing some horrible re-writes :e.g: select pav1.person_id from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id||null in (2,3)) etc. regards Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] What is the best way to do attribute/values?
Hi list, I have a database with 1M people in it. Each person has about 20 attributes, such as height, weight, eye color, etc. I need to be able to search for people based on these attributes. A search can be conducted on one attribute, all attributes, or any number in between. How would _you_ do this? I have already attempted to answer this. My attempts are detailed here: http://sh.nu/email.txt This is the email I was originally going to send to this list. Since it's so large, I decided to link to it instead. If you feel that it belongs in a post to the list, let me know, and I'll post again. I've discussed these attempts with people in #postgresql on irc.freenode.net. Agliodbs (I presume you know who this is) was very helpful, but in end was at a loss. I find myself in the same postition at this time. He suggested I contact this list. My ultimate goal is performance. This _must_ be fast. And by fast, I mean, 1 second, for every permutation of the number of attributes searched for. Flexibility would be a bonus, but at this point I'll settle for something that's harder to maintain if I can get the speed gain I need. Thanks, Daniel Ceregatti ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] What is the best way to do attribute/values?
Folks, I've discussed these attempts with people in #postgresql on irc.freenode.net. Agliodbs (I presume you know who this is) was very helpful, but in end was at a loss. I find myself in the same postition at this time. He suggested I contact this list. There's a couple of issues here to attack: 1) PostgreSQL is not using the most optimal plan.First, it's ignoring the fact that all referenced columns are indexed and only using the first column, then filtering based on the other criteria. Second, testing has shown that a hash join would actually be faster. We've tried upping the statistics, but it doesn't seem to have an effect on the planner's erroneous estimates. 2) Even were it using the most optimal plan, it's still to slow. As you can see from the plan, each merge join takes about 1.5 to 2 seconds.(hash joins are only about 0.5 seconds slower). Mysteriously, a big chunk of this time is spent *in bewtween* planner steps, as if there was some hold-up in retrieving the index or table pages. There may be, but Daniel and I have not been able to diagnose the cause. It's particularly mysterious since a filter-and-sort on a *single* criteria set, without join, takes 400ms. Things we've already tried to avoid going over old ground: 1) increasing statistics; 2) increasing sort_mem (to 256MB, which is overkill) 3) testing on 8.0 beta, which does not affect the issue. At this point I'm looking for ideas. Suggestions, anyone? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] What is the best way to do attribute/values?
Daniel Ceregatti wrote: Hi list, I have a database with 1M people in it. Each person has about 20 attributes, such as height, weight, eye color, etc. I need to be able to search for people based on these attributes. A search can be conducted on one attribute, all attributes, or any number in between. How would _you_ do this? I have already attempted to answer this. My attempts are detailed here: http://sh.nu/email.txt Hmm... interesting. Shot in the dark - try a tsearch2 full-text index. Your problem could be translated into searching strings of the form hair=black eyes=blue age=117 Not pretty, but might give you the speed you want. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly