Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Dieter Maurer wrote: Jim Washington wrote at 2007-3-27 16:28 -0400: ... Yes, I think so, at least in the implementation/algorithm I am using. There may be other implementations that do not need this. Note, however, that the canonical list does not have to be complex objects. The canonical list is just a representation of the unsorted state. It can, for example, be a proxy list of iids, list indexes, or OOBTree keys. The algorithm does not care what it is reordering. If you need to keep the canonical list around, then sort them and then keep the sorted result around (i.e. cache the sorted list). This way, you could avoid the factoradic index. Yes, if there is only one ordering that will be used. Factoradic indices may be obtained and used for any arbitrary ordering of the canonical list. It may be useful, for example, to keep an index that has books sorted by title, another one sorted by author and title, and another one in publication date order. - Jim Washington ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Lennart Regebro wrote at 2007-3-28 18:25 +0200: On 3/27/07, Dieter Maurer [EMAIL PROTECTED] wrote: However, this approach is only efficient when the sort index size is small compared to the result size. Sure. But with incremental searching, the result size is always one, right? ;-) No. You want to have one (the best one) hit from a (potentially) large set of hits. The size of this set is essential whether or not a sort index is efficient. The principle is like this: Let S be the hit set. Assume you sort index consists of values i1 i2 and use D(i) for the set of documents indexed under i, Then D(i1) intersect S preceeds D(i2) intersects S preceeds D(i3) intersects S, etc in the result ordered by the index i. If the index size is small compared to S (and if we assume the hits are uniformly distributed over the indexed values), then each intersection can determine (on average) a significant amount of sorted hits. You can efficiently determine the first hits. Assume on the other hand that S contains a single element and that the index is large, then almost all intersections are a vaste of time (as the result is the empty set). -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On 3/26/07, Dieter Maurer [EMAIL PROTECTED] wrote: When IncrementalSearch does it, it works roughly as Jim described it under 1 above (although there is no need that it works the the primary key directly). Right. Unless the indexes are able to return sorted (partial results), we need to determine the results first and then sort them. And that takes time at least linear in the number of hits (to determine the sort values for the documents). OK, at least this avoids the big intermediate results when searching over several indexes. But you still have to get all of the results, and sort them before you can return the X first. I have the impression that Lucene somehow solves this with their sorting indexes, but I'm not sure, and I haven't tried to understand the code. -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 26, 2007, at 8:45 PM, Martijn Faassen wrote: ... I think we can safely conclude that: * there is no silver bullet in all this (your point) Yes * there is probably room for improvement. Yes ... My hopes: * is that there is some low-hanging fruit in improving things. Possibly, including through education. ... * that we may be able to provide some more infrastructure to help developers in scaling particular catalog usage scenarios (batching with sorting). Possibly, although if this is really a goal, it will require: - Educating people on how hard this is and - Providing infrastructure that let's people record use alternate document ids or supplementary document ids. (Not even thinking about relevance ranks.) I suspect that such an infrastructure will be too hard to use for many people. I still get the impression that you think that batching is more of an opportunity than it really is. Yes, it does reduce time significantly, but still not enough to achieve scalability. (It really only lets you deal with somewhat larger small result sets.) I argue these points as you initially gave me the strong impression saying that there's no point in even talking about all this. I never said any such thing, In fact, I spent quite a bit of effort talking about it. After that I thought we were actually going somewhere with this discussion, but you now strengthen this impression by apparently giving up in exasparation. That is what is making *me* slightly exasparated. :) Fine. I gave up because you dismiss literature and theory. You seem to imply that in practice there is some kind of magic that will somehow make sorts go fast despite any theoretical basis. This convinced me that further discussion was a waste of time. Jim -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Hi, Martijn I have a suggestion, only because I have played around with the idea a bit. Google for python factoradics, and you will get my blog entry about factoradics. I see the problem statement as How to obtain batching without re-sorting multiple times. If you see a sort order as one permutation of a list, the factoradic technique provides a key to that permutation. So, in theory, one would sort the list, and store the factoradic index for that permutation. The next time the particular sort order is requested, it's a simple matter of unpacking the factoradic. I have not done any tests to see whether unpacking a factoradic is significantly less expensive than re-sorting. Intuitively, it should be. In practice, I am not so sure. Anyway, this is FWIW. :) -Jim Washington ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Hey Jim, On 3/27/07, Jim Fulton [EMAIL PROTECTED] wrote: [snip] After that I thought we were actually going somewhere with this discussion, but you now strengthen this impression by apparently giving up in exasparation. That is what is making *me* slightly exasparated. :) Fine. I gave up because you dismiss literature and theory. You seem to imply that in practice there is some kind of magic that will somehow make sorts go fast despite any theoretical basis. This convinced me that further discussion was a waste of time. I can see how you interpreted my statements that way. My apologies; I did not intend to dismiss literature and theory nor did I wish to give the impression that I do. All I tried to express is that algorithmic scalability is just one aspect of scalability issues, and lower-level design choices and constant factors do matter in the final execution. I don't know how much - getting a rough idea requires the comparisons we talked about, and I intend to do this at some point. :) And theory clearly puts upper limits on what is possible, and the upper limits (which are not as high as people would naively expect). I was assuming you understood that I understood the fact that theory provides a limit as a given. :) I realize that there is no way to improve things fundamentally in all cases, and I realize that writing an easy-enough API that helps matters in at least some cases will be a challenge. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Jim Washington wrote at 2007-3-27 08:24 -0400: ... If you see a sort order as one permutation of a list, the factoradic technique provides a key to that permutation. So, in theory, one would sort the list, and store the factoradic index for that permutation. The next time the particular sort order is requested, it's a simple matter of unpacking the factoradic. I have not done any tests to see whether unpacking a factoradic is significantly less expensive than re-sorting. Intuitively, it should be. In practice, I am not so sure. In what way is it different from caching? The packed factoradic seems like a cache to me. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Lennart Regebro wrote at 2007-3-27 11:59 +0200: ... OK, at least this avoids the big intermediate results when searching over several indexes. But you still have to get all of the results, and sort them before you can return the X first. I have the impression that Lucene somehow solves this with their sorting indexes, but I'm not sure, and I haven't tried to understand the code. The ZCatalog, too, can use sorting indexes (and AdvancedQuery which stole the idea from ZCatalog). However, this approach is only efficient when the sort index size is small compared to the result size. If the result size is large, the sorting index can only help you to get the sorting values quite fast as they are stored compactly together. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Jim Fulton wrote at 2007-3-26 15:55 -0400: ... On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote: Jim Fulton wrote at 2007-3-25 09:53 -0400: On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. The major costs in sorting are usually not the log(n) but the very high linear costs fetching the sort keys (although for huge n, we will reach the asymptotic limits). Right. The problem is the N not the log(N). :) Under normal conditions, a relational database can be far more efficient to fetch values either from index structures or the data records than Zope -- as * its data representation is much more compact * it often supports direct access * the server itself can access and process all data. With the ZODB, the data is hidden in pickles (less compact), there is no direct access (instead the complete pickle need to be decoded) The catalog sort index mechanism uses the un-index data structure in the sort index to get sort keys. This is a pretty compact data structure. The data usually is in IOBuckets which contain 45 values on the average. In a corresponding relational index structure, you could have several hundreds of values. and all operations are done in the client (rather than in the server). Which is often fine if the desired data are in the client cache. It avoids making the storage a bottleneck. Yes, but the if is important. Quite often, some operations flush almost all objects from the cache (partly because the cache is controlled by the number of objects and not by their size) and after that, filling the cache again takes ages. Moreover, a relational database can (and usually does) use caching as well. It is not restricted to a cliend side only technique. I know that most relational database backends have a different architecture than ZEO: use one process (or at least one thread) per connection such that several activities can interleave the high IO wait times. The one thread ZEO architecture must take more care not to become the bottleneck. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
Dieter Maurer wrote: Jim Washington wrote at 2007-3-27 08:24 -0400: ... If you see a sort order as one permutation of a list, the factoradic technique provides a key to that permutation. So, in theory, one would sort the list, and store the factoradic index for that permutation. The next time the particular sort order is requested, it's a simple matter of unpacking the factoradic. I have not done any tests to see whether unpacking a factoradic is significantly less expensive than re-sorting. Intuitively, it should be. In practice, I am not so sure. In what way is it different from caching? The packed factoradic seems like a cache to me. Hi, Dieter A factoradic index is representable as a long integer. Given that integer and the canonical list, you can regenerate the permutation represented by that integer. So, instead of caching the sorted list itself, you find and keep this integer, which is all the information needed to algorithmically re-obtain the sorted list. So, this would be slower than caching, but (I think) faster than re-sorting. -Jim Washington ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote: Jim Fulton wrote at 2007-3-25 09:53 -0400: On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. The major costs in sorting are usually not the log(n) but the very high linear costs fetching the sort keys (although for huge n, we will reach the asymptotic limits). Right. The problem is the N not the log(N). :) Under normal conditions, a relational database can be far more efficient to fetch values either from index structures or the data records than Zope -- as * its data representation is much more compact * it often supports direct access * the server itself can access and process all data. With the ZODB, the data is hidden in pickles (less compact), there is no direct access (instead the complete pickle need to be decoded) The catalog sort index mechanism uses the un-index data structure in the sort index to get sort keys. This is a pretty compact data structure. and all operations are done in the client (rather than in the server). Which is often fine if the desired data are in the client cache. It avoids making the storage a bottleneck. Jim -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security. Jim -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On 3/25/07, Jim Fulton [EMAIL PROTECTED] wrote: On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote: MF I think one of the main limitations of the current catalog (and MF hurry.query) is efficient support for sorting and batching the query MF results. The Zope 3 catalog returns all matching results, which can then MF be sorted and batched. This will stop being scalable for large MF collections. A relational database is able to do this internally, and is MF potentially able to use optimizations there. What evidence to you have to support this assertion? We did some literature search on this a few years ago and found no special trick to avoid sorting costs. I know of 2 approaches to reducing sort cost: 1. Sort your results based on the primary key and therefore, pick your primary key to match your sort results. In terms of the Zope catalog framework, the primary keys are the document IDs, which are traditionally chosen randomly. You can pick your primary keys based on a desired sort order instead. A variation on this theme is to use multiple sets of document ids, storing multiple sets of ids in each index. Of course, this approach doesn't help with something like relevance ranks. 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security. I don't know if relational databases typically does this internally (I don't think so). However, some search engines do it, like Lucene. And supposedly also Dieters IncrementalSearch (haven't used it yet). -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about search+filter
On Mar 25, 2007, at 11:08 AM, Lennart Regebro wrote: ... 2. Use an N-best algorithm. If N is the size of the batch and M is the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which is a significant improvement if N M, but still quite expensive. I don't think relational databases have any magic bullet to get around sorting costs. Sorting is expensive. In many ways, I think the sorting support in the catalog gave people a false sense of security. I don't know if relational databases typically does this internally (I don't think so). However, some search engines do it, like Lucene. And supposedly also Dieters IncrementalSearch (haven't used it yet). Our catalog framework also has N-best support. JIm -- Jim Fulton mailto:[EMAIL PROTECTED]Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporationhttp://www.zope.com http://www.zope.org ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com