I started writing http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema . I need help with two things:
* test cases http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HTestCases * if time permits, review the proposal, especially http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HAMixedApproach . Thanks, Marius On Fri, Oct 11, 2013 at 12:55 PM, Marius Dumitru Florea <[email protected]> wrote: > Hi devs, > > This is a very important question so think carefully. Let me explain: > > In XWiki (model) we have a few entity types. There are *wikis* which > have *spaces* which have *documents*. A document can have *objects* > and *attachments*. A document can also define a *class*. > > At the same time we like to say that in XWiki "everything is a > document" because everything revolves around documents. The document > is the central notion. > > We can query the database (using HQL or XWQL) for any of the > previously mentioned entities but what should a Solr query return > (semantically)? In other words: > > * are you searching for an object without caring about the document > that holds the object? Same for an object property. > * how often are you searching for an attachment without caring about > the document that holds the attachment? > * are you searching for a class or for the document that defines that class? > * are you searching for a wiki without caring about the documents it > contains? Same for a space. > > IMO the result of a Solr query should be, semantically, a list of > documents. But maybe I'm wrong. > > ----------------------- > Technical Details > ----------------------- > > Unlike a relational database, Solr/Lucene index has a single 'table'. > So normally you index a single entity type. Each row in the index > represents an entity of that type. As a consequence the result of a > Solr query is semantically a list of entities of that type. In our > case the entity type is (naturally) *document*. > > If you want to index more entity types (e.g. index attachments and > objects _separately_, not as part of a document) then, since there is > only one 'table' in the index, you need to add a 'type' column that > specifies the type of entity you have on each row (e.g. type=document, > type=attachment, type=object etc.). The result of a Solr query is now, > semantically, a list of different entity types, unless you filter by a > specific type. It smells like a hack to me. > > Let's imagine what happens if we want to search for blog posts that > has a specific tag. With the first approach this is easy because all > the (indexed) information is on a single row. With the second approach > this is considerably more complex because the information is spread on > multiple rows: > > * one row with type=document for the blog post document > * one row with type=object for the blog post object > * one row with type=object for the tab object > > In a relational database when you have the information spread in > multiple places (tables) you do joins. Fortunately (you would says) > Solr supports joins. In this particular case we would have to perform > 2 joins which means: > > index X index X index > > where X represents the cartesian product. The document name would be > the join key. Pretty complex even before trying to write this in Solr > query syntax.. > > So basically the question becomes: is it worth indexing more entities > _separately_ instead of indexing just documents (with info about their > objects and attachments) considering the complexity that it brings in > writing Solr queries? Do we search for objects and attachments alone > as separate entities often enough to justify this complexity? My > answer is no. > > Thanks, > Marius _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

