Re: [xwiki-devs] [Solr] What do we search for?

Eduard Moraru Sun, 13 Oct 2013 14:28:36 -0700

Hi,

The initial idea with modelling things this way in the Solr index was to
allow devs to be able to search for other entities than documents (objects
and their properties) using Solr instead of having to use (HS/XW)QL.


An interesting use case from my POV was when a dev wanted to search for a
piece of code (but could be just a word/string) that is stored in an XWiki
object. We keep assuming that everybody stores one object in one document,
but it`s obviously not the case every time and, some app devs might want to
store their application`s entities as multiple objects inside a single page
(since we always have that dilemma when starting to dev an application).
Also, another example was mentioned by Ludovic with XWikiComments.

The problem in the previous mapping was that we would find out the document
where the object is stored, but we would not know in which actual object
that result is found, since we indexed property values in the form of
CassName.propertyName:value. So telling the dev that the string/code/etc
he's looking for is somewhere in that document does not help much.

Using Solr, we might now know which field (propertyName) that was (with the
highlighting component), but we would probably need to add some object
ID/number information in there (like CassName.0.propertyName:value) and
that might mess up the query syntax. We would have to write queries like
CassName.*.propertyName:searchedWord and I don`t remember if Solr supports
the usage of wildcards in field names (AFAIR, only in field values, except
for the catch-all *:* construction). AFAIR, this was the main reason why we
had to use dynamic fields and the whole multilingual work.

I also had my doubts about mapping properties as first class Lucene
documents (or "tables" as previously referred), but, now that I think of
it, it provides a solution for the example above. Maybe using just objects
would have sufficed as well. I don`t know, more examples may exist or not,
but it's good that we`ve started talking about it.

Marius, if you`re interested, I`m available this starting week for some
brainstorming on the subject, if that would help. Just let me know.

Thanks,
Eduard


On Fri, Oct 11, 2013 at 3:48 PM, Marius Dumitru Florea <
[email protected]> wrote:

> On Fri, Oct 11, 2013 at 1:26 PM, Ludovic Dubost <[email protected]> wrote:
> > Hi,
> >
> > From my point of view we usually search mostly for two types of things:
> >
> > - documents
> > - attachements
> >
> > But we should be able to filter these results on multiple property values
> > of any object. This is true also for documents and for attachments.
> > It is also interesting to be able to present results differently
> depending
> > on the document we get (if it's has meeting document or a user document
> we
> > display things differently)
>
> > Being able to search for attachments separately is very important.
>
> This was possible with the old Lucene index because we were indexing
> attachments in separate rows _but_ we were duplicating all document
> fields on the attachment row. So if you had a document with 2
> attachments then you had 3 rows associated in the Lucene index: one
> for the document itself and 2 for the attachments but the document
> fields were duplicated twice. Of course we can say we don't care about
> the index size (do we? :) ) but we must be careful to not get
> duplicated results because of the duplicated information in the index.
>
> Thanks,
> Marius
>
> >
> > As for objects most of the time we search for documents that have this
> > specific object.
> > There is however a use case I see where it could be interesting to search
> > in individual objects.
> > For example this is the case for comments. It could be interesting to
> make
> > a search in all comments.
> >
> > Another example could be tasks. Suppose you add tasks inside documents
> > associated to some content of the document (like annotations).
> > You might want to be able to make some nice search on all the tasks and
> > then display a link to the document in which the task is but not the
> other
> > way around.
> >
> > Now I think this use case could be optional, so we don't necessarly need
> to
> > index all objects of all classes. We could have some config which tells
> to
> > make an index for all comments objects or all task objects. I think we
> > already had an object index in lucene and I don't remember if we have
> ever
> > used it.
> >
> > I don't think we need an index on all properties.
> >
> > Ludovic
> >
> >
> >
> > 2013/10/11 Marius Dumitru Florea <[email protected]>
> >
> >> Hi devs,
> >>
> >> This is a very important question so think carefully. Let me explain:
> >>
> >> In XWiki (model) we have a few entity types. There are *wikis* which
> >> have *spaces* which have *documents*. A document can have *objects*
> >> and *attachments*. A document can also define a *class*.
> >>
> >> At the same time we like to say that in XWiki "everything is a
> >> document" because everything revolves around documents. The document
> >> is the central notion.
> >>
> >> We can query the database (using HQL or XWQL) for any of the
> >> previously mentioned entities but what should a Solr query return
> >> (semantically)? In other words:
> >>
> >> * are you searching for an object without caring about the document
> >> that holds the object? Same for an object property.
> >> * how often are you searching for an attachment without caring about
> >> the document that holds the attachment?
> >> * are you searching for a class or for the document that defines that
> >> class?
> >> * are you searching for a wiki without caring about the documents it
> >> contains? Same for a space.
> >>
> >> IMO the result of a Solr query should be, semantically, a list of
> >> documents. But maybe I'm wrong.
> >>
> >> -----------------------
> >> Technical Details
> >> -----------------------
> >>
> >> Unlike a relational database, Solr/Lucene index has a single 'table'.
> >> So normally you index a single entity type. Each row in the index
> >> represents an entity of that type. As a consequence the result of a
> >> Solr query is semantically a list of entities of that type. In our
> >> case the entity type is (naturally) *document*.
> >>
> >> If you want to index more entity types (e.g. index attachments and
> >> objects _separately_, not as part of a document) then, since there is
> >> only one 'table' in the index, you need to add a 'type' column that
> >> specifies the type of entity you have on each row (e.g. type=document,
> >> type=attachment, type=object etc.). The result of a Solr query is now,
> >> semantically, a list of different entity types, unless you filter by a
> >> specific type. It smells like a hack to me.
> >>
> >> Let's imagine what happens if we want to search for blog posts that
> >> has a specific tag. With the first approach this is easy because all
> >> the (indexed) information is on a single row. With the second approach
> >> this is considerably more complex because the information is spread on
> >> multiple rows:
> >>
> >> * one row with type=document for the blog post document
> >> * one row with type=object for the blog post object
> >> * one row with type=object for the tab object
> >>
> >> In a relational database when you have the information spread in
> >> multiple places (tables) you do joins. Fortunately (you would says)
> >> Solr supports joins. In this particular case we would have to perform
> >> 2 joins which means:
> >>
> >> index X index X index
> >>
> >> where X represents the cartesian product. The document name would be
> >> the join key. Pretty complex even before trying to write this in Solr
> >> query syntax..
> >>
> >> So basically the question becomes: is it worth indexing more entities
> >> _separately_ instead of indexing just documents (with info about their
> >> objects and attachments) considering the complexity that it brings in
> >> writing Solr queries? Do we search for objects and attachments alone
> >> as separate entities often enough to justify this complexity? My
> >> answer is no.
> >>
> >> Thanks,
> >> Marius
> >> _______________________________________________
> >> devs mailing list
> >> [email protected]
> >> http://lists.xwiki.org/mailman/listinfo/devs
> >>
> >
> >
> >
> > --
> > Ludovic Dubost
> > Founder and CEO
> > Blog: http://blog.ludovic.org/
> > XWiki: http://www.xwiki.com
> > Skype: ldubost GTalk: ldubost
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Solr] What do we search for?

Reply via email to