Re: [Zope-dev] catalog performance: query plan
On Sun, Nov 9, 2008 at 19:58, Roché Compaan [EMAIL PROTECTED] wrote: Since I'm in full agreement that we need to fix indexes that are problematic, I started doing some benchmarks on the large data set that gave us so many headaches. It is probably not surprising that the more complex indexes are performing badly. DateRangeIndex, KeywordIndex and Plone's ExtendedPathIndex performed the worst. Below are some stats showing timings around the apply_index call in Catalog.py that was done while testing the application with real data: ExtendedPathIndex doesn't need fixing, but we need to stop using it. It's done to support navigation trees from the catalog, but navigation should not be done via the same catalog as you do other things, but a dedicated tool. That would simplify and speed things up a lot. But OK, that's off-topic. -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
On Sun, Nov 9, 2008 at 22:29, Matt Hamilton [EMAIL PROTECTED] wrote: Lennart Regebro regebro at gmail.com writes: I would be interested in seeing a bunch of Gurus sit down at some sprint and trying to come up with a catalog engine that is incremental and uses query plans. There is no reason that would not be stupidly fast. :) We can then make a new catalog that uses this engine but has the same API as the old one, to ship with some future version of Zope, say 2.12. There is the Plone Performance sprint we are hosting in Bristol, UK on the 11th - 14th Dec. http://plone.org/events/sprints/bristol-performance-sprint Whilst it is billed as a Plone sprint, of course much of the speedups can be done at the Zope level, so Zope-only developers are more than welcome :) This is exactly the kind of thing that I like hacking on personally, so would love to see it worked on at the sprint. Cool. I do not have time in December though, so some other time. And if we could get Dieter Maurer and Helge Tesdal in on this, as they has experience and understanding of the issues that would be great. That's probably going to take even more planning, so maybe for a future performance sprint somewhere? -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
Hi Tres, Index Name |Type |Avg Time |Calls/second == object_implements|KeywordIndex |0.2172234| 4.6 This is clearly not the same issue as the other KeywordIndexes: in fact, I am astonished that anybody would be using a KeywordIndex for this at all. I would suspect that the real problem here is in the appliation, rather than the index itself. Why? object_implements indexes a list of interface dotted names. Would another type of index be more appropriate? UID |FieldIndex |0.0003070| 3257.1 Note that this is the worst-case scenario for a FieldIndex: there is exactly one value for every key. This shouldn't be indexed at all, in fact, beyond a simple BTree (UID - rid). Good point. I wonder how many places we use a UID index. UID *metadata* is quite important, of course. targetUID|FieldIndex |0.0002287| 4372.12 I don't know what this one is used for, but it should probably be scrapped as well. Me neither ... sounds bogus. Title|ZCTextIndex |0.128|77809.46 This should be removed: there is no valid use case for doing a full-text search restricted only to the title. I'm pretty amazed that this is a ZCTextIndex as well. I always thought it was a FieldIndex. Description |ZCTextIndex |0.116|86241.39 Again, should be removed. Right. getEmail |ZCTextIndex |0.113|87849.05 Should *definitely* be removed: how can you do full-text search on an e-mail address? Surely this is application specific too? I don't think Plone has such an index. SearchableText |TextIndex|0.113|88466.69 Where did this one come from? The 'SearchableText' above is a ZCTextIndex. It certainly is in vanilla Plone. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
On Mon, 2008-11-10 at 18:38 +0200, Hedley Roos wrote: Kinda pointless for me to continue since this is turning into a Plone-specific discussion on zope-dev. But at least the whole exercise has forced us to look in detail into how all these indexes affect performance with a zodb with many many objects. Roche investigated Tesdal's queryplan today end it seems to solve nearly all our performance problems. He'll have to elaborate. Well that is not really true. What solved our performance problems is not querying on object_implements and getEffective_or_created. I have previously done benchmarks with query plan and it didn't make any noticeable difference. What might be true or is becoming more likely is that indexes are used where they don't fit the use case rather than that the indexes themselves need optimisation. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
On Mon, 2008-11-10 at 11:08 -0500, Tres Seaver wrote: Index Name |Type |Avg Time |Calls/second == object_implements|KeywordIndex |0.2172234| 4.6 This is clearly not the same issue as the other KeywordIndexes: in fact, I am astonished that anybody would be using a KeywordIndex for this at all. I would suspect that the real problem here is in the appliation, rather than the index itself. getEffective_or_creat|DateIndex|0.1941770|5.15 effectiveRange |DateRangeIndex |0.0086295| 115.88 allowedRolesAndUsers |KeywordIndex |0.0069754| 143.36 Hmm, I'm surprised there: what query is being passed to 'apply_index' for this call? Well it is not really performing badly at 6ms? path |ExtendedPathIndex|0.0040614| 246.22 I don't trust the EPI implementation at all. portal_type |FieldIndex |0.0025984| 384.84 This one is surprising: its performance should be pretty similar to the other FieldIndexes (e.g., 'review_state') which map a controlled vocabulary onto the entire corpus. Was the query different than 'review_state' (e.g., multi-valued vs. single-valued)? It's still not bad at 2ms. It has a lot more keys than review_state though. SearchableText |ZCTextIndex |0.0007645| 1308.04 sourceUID|FieldIndex |0.0004886| 2046.31 Probably bogus, but I don't know how it is used. I'm not really worried about indexes beyond this point - they're all returning results in less than a millisecond. Can you provide information on the corpus / configuration / test plan you used to generate these results? It's basically a Plone site with 300,000 remember based users and roughly 150,000 documents and images indexed. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
Lennart Regebro wrote: On Sun, Nov 9, 2008 at 19:58, Roché Compaan [EMAIL PROTECTED] wrote: Since I'm in full agreement that we need to fix indexes that are problematic, I started doing some benchmarks on the large data set that gave us so many headaches. It is probably not surprising that the more complex indexes are performing badly. DateRangeIndex, KeywordIndex and Plone's ExtendedPathIndex performed the worst. Below are some stats showing timings around the apply_index call in Catalog.py that was done while testing the application with real data: ExtendedPathIndex doesn't need fixing, but we need to stop using it. It's done to support navigation trees from the catalog, but navigation should not be done via the same catalog as you do other things, but a dedicated tool. That would simplify and speed things up a lot. But OK, that's off-topic. I wander if this could be replaced by zc.relationship / plone.relations? There is potential for removing the five.intid / zope.app.keyreference layer of indirection if the actual oid was stored instead, with an index to a list of database names packed into the first byte. There would even be room to store a reference to the objects class (using the pickle protocol 2 registry to convert this to an integer) in the next two or three bytes if creating ghosts were useful. This would still leave at least 32 bits of space (4 billion) for the actual object id. Without storing the aq_chain explicitly we would need to ensure that __parent__ pointers were pickled for all content objects. The objects themselves could be used instead of metadata rows (without a security check it would be as simple as loading the oid from the relevant db connection). So long as all the required metadata was stored on the object itself only one load would be required for each object. If this same keyreference were used in the indexes of the catalog instead of rowids then result sets could be merged. The downside is that the set intersections would require double the memory of the current 32 bit ids. Laurence ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
Lennart Regebro regebro at gmail.com writes: I would be interested in seeing a bunch of Gurus sit down at some sprint and trying to come up with a catalog engine that is incremental and uses query plans. There is no reason that would not be stupidly fast. :) We can then make a new catalog that uses this engine but has the same API as the old one, to ship with some future version of Zope, say 2.12. There is the Plone Performance sprint we are hosting in Bristol, UK on the 11th - 14th Dec. http://plone.org/events/sprints/bristol-performance-sprint Whilst it is billed as a Plone sprint, of course much of the speedups can be done at the Zope level, so Zope-only developers are more than welcome :) This is exactly the kind of thing that I like hacking on personally, so would love to see it worked on at the sprint. -Matt -- Matt Hamilton [EMAIL PROTECTED] Netsight Internet Solutions, Ltd. Understand. Develop. Deliver http://www.netsight.co.uk +44 (0)117 9090901 Web Design | Zope/Plone Development Consulting | Co-location | Hosting ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] catalog performance: query plan
On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote: I agree with Tres. A lot more can be done with Indexes and Catalog without caching. The most exiciting development in Catalog optimizations comes out Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when he was in college. He has a protoype of a query plan for ZCatalog. http://www.jarn.com/blog/catalog-query-plan I would like to ask Roche and others to look at the Query Plan. We looked at query plan but it didn't help us in any way. Some catalog indexes are performing very badly and most of our content is in a published state which doesn't help the query plan much. Caching is a total PITA because invalidation machinery becomes overwhelming complex and unwieldly quickly in production. I agree but this was the only thing that we could do to even go into production. Since I'm in full agreement that we need to fix indexes that are problematic, I started doing some benchmarks on the large data set that gave us so many headaches. It is probably not surprising that the more complex indexes are performing badly. DateRangeIndex, KeywordIndex and Plone's ExtendedPathIndex performed the worst. Below are some stats showing timings around the apply_index call in Catalog.py that was done while testing the application with real data: Index Name |Type |Avg Time |Calls/second == object_implements|KeywordIndex |0.2172234| 4.6 getEffective_or_creat|DateIndex|0.1941770|5.15 effectiveRange |DateRangeIndex |0.0086295| 115.88 allowedRolesAndUsers |KeywordIndex |0.0069754| 143.36 path |ExtendedPathIndex|0.0040614| 246.22 portal_type |FieldIndex |0.0025984| 384.84 SearchableText |ZCTextIndex |0.0007645| 1308.04 sourceUID|FieldIndex |0.0004886| 2046.31 UID |FieldIndex |0.0003070| 3257.1 targetUID|FieldIndex |0.0002287| 4372.12 exact_getUserId |FieldIndex |0.0001931| 5177.79 exact_getUserName|FieldIndex |0.0001816| 5504.39 relationship |FieldIndex |0.822| 12153.1 id |FieldIndex |0.822|12161.81 end |DateIndex|0.623|16027.48 getGroups|FieldIndex |0.278|35973.45 getArtistTitle |FieldIndex |0.259|38495.53 review_state |FieldIndex |0.259|38582.22 Subject |KeywordIndex |0.253|39413.57 getDaysOfTheWeek |KeywordIndex |0.247|40465.98 meta_type|FieldIndex |0.199|50116.64 exact_getGroupId |FieldIndex |0.162|61417.51 getVideoURL |FieldIndex |0.155| 64447.5 year |FieldIndex |0.155|64460.43 Title|FieldIndex |0.136|73381.01 getId|FieldIndex |0.131|76056.97 Title|ZCTextIndex |0.128|77809.46 startendrange|DateRangeIndex |0.127|78485.82 expires |DateIndex|0.126|79001.59 getObjPositionInParen|FieldIndex |0.124| 80675.9 targetId |FieldIndex |0.122|81418.68 effective|DateIndex|0.121| 82651.7 getProvince |FieldIndex |0.117|85198.54 month|FieldIndex |0.116|85762.56 Description |ZCTextIndex |0.116|86241.39 Type |FieldIndex |0.115|86345.17 getLast_login_time |DateIndex|0.115|86698.98 Creator |FieldIndex |0.113|87840.03 getEmail |ZCTextIndex |0.113|87849.05 cmf_uid |FieldIndex |0.113|88352.13 getDuration |FieldIndex |0.113|88454.29 SearchableText |TextIndex|0.113|88466.69 sortable_title |FieldIndex |0.112|88698.49 getRating|FieldIndex |0.112| 88747.5 getGenres|KeywordIndex |0.112|88796.55 object_provides |KeywordIndex |0.112|88919.43 getEventType |KeywordIndex |0.112| 88953.9 in_reply_to |FieldIndex |0.112|89057.46 getReview_state |FieldIndex |0.112|89124.63 is_folderish |FieldIndex |0.112|89240.51 getRawRelatedItems |KeywordIndex |0.111|89568.91 getThumbSize |FieldIndex |0.111|89653.89 getStudioCamURL |FieldIndex |0.111|89678.92 Date |DateIndex|0.111|89799.23 getHash |FieldIndex |0.111|90111.54 getNumberOfComments |FieldIndex |0.110|90141.88 start
Re: [Zope-dev] catalog performance: query plan
On Mon, Oct 27, 2008 at 17:32, Alan Runyan [EMAIL PROTECTED] wrote: I agree with Tres. A lot more can be done with Indexes and Catalog without caching. The most exiciting development in Catalog optimizations comes out Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when he was in college. He has a protoype of a query plan for ZCatalog. http://www.jarn.com/blog/catalog-query-plan I would like to ask Roche and others to look at the Query Plan. Caching is a total PITA because invalidation machinery becomes overwhelming complex and unwieldly quickly in production. I don't know very much about searching, but this definitely sounds like a good idea. Also, especially when doing free text searching that has large result sets, incremental searching is very beneficial. I know Dieter Maurer has made a Zope2 implementation of this. I would be interested in seeing a bunch of Gurus sit down at some sprint and trying to come up with a catalog engine that is incremental and uses query plans. There is no reason that would not be stupidly fast. :) We can then make a new catalog that uses this engine but has the same API as the old one, to ship with some future version of Zope, say 2.12. -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
[Zope-dev] catalog performance: query plan
I agree with Tres. A lot more can be done with Indexes and Catalog without caching. The most exiciting development in Catalog optimizations comes out Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when he was in college. He has a protoype of a query plan for ZCatalog. http://www.jarn.com/blog/catalog-query-plan I would like to ask Roche and others to look at the Query Plan. Caching is a total PITA because invalidation machinery becomes overwhelming complex and unwieldly quickly in production. -- Alan Runyan Enfold Systems, Inc. http://www.enfoldsystems.com/ phone: +1.713.942.2377x111 fax: +1.832.201.8856 ___ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )