(and pass the result of that into a collection-query..) -----Oorspronkelijk bericht----- Van: [email protected] [mailto:[email protected]] Namens Geert Josten Verzonden: woensdag 2 maart 2011 22:23 Aan: General MarkLogic Developer Discussion Onderwerp: Re: [MarkLogic Dev General] efficient storage/retrieval scheme
Hi Mike, How about passing a wildcard search for the partial cases to cts:collection-match? Kind regards, Geert -----Oorspronkelijk bericht----- Van: [email protected] [mailto:[email protected]] Namens Mike Sokolov Verzonden: woensdag 2 maart 2011 22:15 Aan: General MarkLogic Developer Discussion Onderwerp: Re: [MarkLogic Dev General] efficient storage/retrieval scheme Thank you to everybody who responded. I ran some tests on 100000 docs with some random data. The upshot is that collection() is about the same speed as estimating an element-value-query. Doing element-query(and-query(attribute-query(),..))) was about 5 times slower (and estimates are wrong in this case: you have to run filtered). So I think I would concur w/Mike Blakeley: collection() (or possibly value-query) for the fully-specified case, and a query based on attribute values for a single dimension query. I'm still up in the air what to do about intermediate cases (ie query two attributes only). We'll see if that's an important use case... Thanks again! -Mike On 03/02/2011 09:30 AM, Mike Sokolov wrote: > I need to design a data element for our platform with an eye to the most > efficient possible retrieval of documents in a collection defined by > this data element. Assume there could be millions of documents. It > will have at least three dimensions: site, content-set, and status; > these are all completely independent. None of these are likely to have > more than a few tens or hundreds of different values: status will have 2 > or 3, definitely less than 10. > > I need to be able to retrieve documents based on the values of each > dimension independently (ie all; documents in content set X), as well as > (and this could be more typical) a fully-specified vector (content-set, > site and status) > > I can think of several possibilities: > > 1. An element whose text includes all three values as words in some > predefined order: > > <collection>cs100 site50 status1</collection> > > with word queries for single dimension queries and value (or maybe > phrase queries?) for joins. > > 2. A ML collection whose name is all three values concatenated in some > order: > > collection("cs100-site50-status1") > > joins of all three dimensions become a simple collection lookup, and > cts:collection-match() for single- or dual-dimension queries. > > 3. An element with three attributes: > <collection cs="100" site="50" status="1" /> > This is attractive from the perspective of XML modeling and will expose > the values neatly for xpath (perhaps we could combine it with one of the > above), but I'm concerned that: > cts:element-query(collection, ...) might not be as efficient for retrieval? > Also: would we need to enable element-position indexes to make this > accurate as an unfiltered query? > > Would anyone care to comment on the "best" design? Other ideas? > > Thanks! > > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
