Re: Proposal : index store - Lucene

Daniel Florey Mon, 19 Jan 2004 08:37:23 -0800

I'm not familiar with the searching facilities so far, but I think there is a 
big difference between searching and sorting/filtering.
So lucene seems to be perfect for searching the repository (properties and 
content). What we need (and what is somehow harder to achieve without db) is 
some kind of sorting/filtering of the search results.
I think of the following use cases:
- Search the repository for the last x uploaded documents sorted by date
- Find all documents since xxx containing yyy
- Find the documents containing xxx in the labeled revision yyy
- Find only documents the user is allowed to read
- What about searching within transactions? Should the search be transaction 
aware (so that the user finds documents uploaded within a transaction but 
other users don't)?
I've no idea how to handle this, but I think this are some things to think 
about...
Daniel


Am Montag, 19. Januar 2004 15:12 schrieb Michael Oliver:
> On Mon, 2004-01-19 at 06:32, Stefano Mazzocchi wrote:
> > On 18 Jan 2004, at 22:12, Christophe wrote:
> > > Stefano Mazzocchi wrote:
> > >>> If you store your properties in one store (eg. DB) and used index
> > >>> store engine for content search, I expected to have some performance
> > >>> issues when you search on prop and content.
> > >>
> > >> hmmm, not sure I follow you, can you elaborate on this more? it would
> > >> be very appreciated.
> > >
> > > How do you make a query that used criteria on properties and full text
> > > search?
> >
> > eh, good question :-)
> >
> > > If the properties/metadata are in a DB and content is tokenized into a
> > > index engine like Lucene. First, you need to select rows from DB
> > > tables and makes a second query into the index store to query on the
> > > content itself.
> > > For this kind of scenario (search on prop AND  full text search), I
> > > expect only one query via Lucene will be faster. Lucene can store
> > > properties that will not be tokenized. Anyway, it is not a ideal
> > > situation because properties have to be duplicate into 2 differents
> > > stores. So, I don't know what will be the best solution !
> >
> > I think we are attacking the problem from the wrong angle: first we
> > need to collect usecases, then we need to find a way to make the
> > usecase possible.
> >
> > I personally wouldn't know how to make use of a query against full text
> > *and* properties. This is because such a query looks weird to me:
> > full-text is the least structure possible (get me everything but I
> > don't know where) while properties tent to be very much structured
> > (last modified time, author, and so on).
> >
> > There is a decades long discussion on what is data and what is metadata
> > and I don't want to touch that with a stick, but I think that if you
> > need to do full-text search on your metadata there is something wrong.
>
> Stefano with all due respect, there is nothing wrong with a full-text
> search on metadata because metadata in this case can be any properties
> of any of the resources in the repository and that meta data can be free
> form text.
>
> consider a search query like
>
> doctype="memo" and description contains "Fire Stefano" and contents
> contains "January"
>
> doctype and description are properties with string values that would be
> indexed and matched with the same index as the contents.
>
> Everybody doesn't use the Database Stores, some actually preter the XML
> Stores so an index of the XML should be full text, yes?
>
> > But this is my very personal vision, of course, and I would like to see
> > what other usecases or scenarios others can come up with before stating
> > where to go.
> >
> > >>> Anyway, Do you have some idea to optimize the current search service
> > >>> ?
> > >>
> > >> I havn't looked into this yet (I'm still lagging behind on some other
> > >> issues with my project and I havn't attacked this part yet).
> > >>
> > >> The idea is to use an RDBMS as much as possible on all content that
> > >> can be turned relational without major issues (and normally metadata
> > >> fits this category). As for full-text search, I agree that there is
> > >> no way to beat an engine like lucene.
> > >
> > > Agree ! I understand your point of view, the best way to query on
> > > properties is certainly the classic select statment but if you need an
> > > index/search engine to for full-text search, I don't know.
> >
> > I personally had this vision before: DASL allows you to select the
> > search language. We already provide the DASL basic-search, nothing
> > stops us from coming up with an entirely new lucene-influenced
> > full-text language that works only on the files contents.
> >
> > So, you do different queries depending on how you want to treat the
> > content.
> >
> > > Furthermore, like Erik explains in a previous mail, you can write some
> > > filter to apply security rules. So, in one query makes in only one
> > > store , you can filters on props, content and security rules.
> > > Can you do that without storing properties into the search engine ?
> > > I'm curious :-)
>
> For clarity indexing properties as they go into the store isn't the same
> as storing properties into the search engine/index.  In other words the
> index of the properties and content just needs access to the data as it
> is being stored and doesn't impact the stores beyond the call, and that
> can be minimized with an indexing queue that can be done asynchronously.
>
> > You could, I think, but it would be tremendously slow compared.
> >
> > > It should be interesting to compare in more detail both solution,
> > > makes performance tests, ...
> > >
> > >>> Why not to support both situation : either inder the prop or not ?
> > >>
> > >> You mean with a global configuration or more granularely?
> > >
> > > Still thinking on that :-)  The idea is to use the domain.xml file to
> > > define how to make the query on props and options used for the full
> > > text search.
> >
> > I think we need to attack the store/indexing problem from the scenario
> > angle down... or we'll go around in circles for a long time. of course,
> > I'm not talking about Slide 2.0 but something to do after the release
> > is done.
>
> I completely agree, a few scenarios/stories should be the first step and
> I hope the example I gave above fits in that category.
>
> Ollie
>
> > > Thanks for this mail,
> >
> > You are welcome.
> >
> > --
> > Stefano.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal : index store - Lucene

Reply via email to