On Mon, Nov 29, 2010 at 2:19 PM, Alexander Klimetschek <[email protected]> wrote: > On 29.11.10 13:21, "Ard Schrijvers" <[email protected]> wrote: >>And this is a big burden! I think, we could have a single big index >>for the JCR spec implementation. But, I wouldn't solve this by having >>more small indexes, as collections. I would like to have an option, in >>case of XPath, like 'simpleXPath=true' where we limit some of the >>options: In other words, not all the jcr spec queries are available, >>but it is efficient and fast (we at Hippo limit ourselves to only >>efficient xpath queries). If you do not by default store all >>properties, and do not have to support complex path constraint (only >>simple ones), then, you wouldn't have to bother that much about one >>single Lucene index. > > As written in my other mail, there are good reasons to allow for separate > indexes, to resolve conflicts of different indexing needs for different > applications. Maybe this is only true for the (node-scoped) full text > index, where you can't exclude certain properties at query time. > > And the big advantage of those collections is that you solve the path > constraint issue, at least for those queries like: > > /content/siteA//*[jcr:contains(., 'term') and @myProp='foo'] > > because you would have a collection for /content/siteA, /content/siteB, > etc. with just the right full text / property index.
We achieved this much easier and more flexible, as we have the *demand* for instant path constraint on any path as well. A little background first: Jackrabbit has a very nice feature, that jcr nodes are not aware of their actual location. Only parent and childs are know. This also holds for the index. This means, that moving a tree with thousands of nodes is a single node change, both in dbase as in index. However, this comes at a price of slow path constraint queries. This was unacceptable for us. Hence, for a node, we index all parent elements in a multivalued Lucene field as well. Suppose my location is: /content/document/news/2009/12/foo . My Lucene field will have the terms: /content /content/document /content/document/news /content/document/news/2009 /content/document/news/2009/12 So, *any* simple path constraint in our repository, is just matching a single lucene term, which is instant. Give me all nodes below ' /content/document/news' are just all the nodes that have the term ' /content/document/news' in our predefined Lucene field (note that we actually use node ids for it, but for the picture, this is easier to understand) > >>Lucene 4.0 will be so blistering fast and efficient... > > Cool. > >>the figures we >>need to index with Jackrabbit is peanuts for Lucene. *If* we improve >>indexing, a couple of hundreds of millions of nodes is a no-brainer! > > With the exception of the path constrained, as this is not indexed. Maybe > it will be easier with Lucene 4.0 to index the path, especially allow for > fast updates of the path property when something is moved? Lucene will hardly have improvements for hierarchical structures. Note that this is exactly what makes jcr indexing so complex: The hierarchy! For small hierarchies, more on Document kind of level, there might be added a NestedDocumentQuery: This is to avoid cross matching see [1]. But this is very simple compared to what Jackrabbit can do with xpath, and it is still in development >>We should not be thinking about problems that are a result of the >>current implementation and its short comings (they are a result that >>it needed to work against Lucene 1.4, this is no critics to be sure!). > > Ok. > >>asynchronous indexing is already part of the jcr 283 afaik and is >>allowed, certainly for binary content > > Sure, but still indexing takes a major part of a save() call, AFAIK. True...and the more important that just one node in a cluster does the actual indexing (or extraction like from pdf, even more important!) Regards Ard [1] https://issues.apache.org/jira/browse/LUCENE-2454 > Regards, > Alex > > -- > Alexander Klimetschek > Developer // Adobe (Day) // Berlin - Basel > > > > > -- Hippo Europe • Amsterdam Oosteinde 11 • 1017 WT Amsterdam • +31 (0)20 522 4466 USA • San Francisco 185 H Street Suite B • Petaluma CA 94952-5100 • +1 (707) 773 4646 Canada • Montréal 5369 Boulevard St-Laurent • Montréal QC H2T 1S5 • +1 (514) 316 8966 www.onehippo.com • www.onehippo.org • [email protected]
