Forwarded: InstantiatedIndex questions

David Causse Tue, 06 Oct 2009 09:55:38 -0700

Hi,

Karl prefer to answer on the ml so here is some informations he asked on
how we use InstantiatedIndex.


----- Forwarded message from David Causse <dcau...@spotter.com> -----

Date: Tue, 6 Oct 2009 15:45:57 +0200
From: David Causse <dcau...@spotter.com>
To: Karl Wettin <ka...@apache.org>
Subject: Re: InstatiatedIndex questions

Hi,

sorry for the delay.

We upgraded to the new Attribute API and InstantiatedIndex didn't
support it, now with 2.9 InstantiatedIndex works great with this API.
We use it cause we build volatile indexes on small doc set 1 to 200 and
apply massive complex queries. It is done in a document flow over
a messaging architecture, it's a sort of routing table.
We use only IR and IW, we have build our query system which is a bit
similar to SpanQuery cause we make intensive use of term positionning.
We do proximity searches based on standard term position but also
with some payload information like phrase id and/or paragraph id and
some generic stuff that permits to add relationships inside the index.

So what is important for us : is fast indexing time and very fast query.
Query lucene for us means 
        - termEnum iteration to do query rewrite/optimizations
        - termDocs/termPosition iteration

On the index time InstantiatedIndex is behind RAMDirectory, but the time
gained over queries make it better (for what I see it can be 2 times
faster).

InstantiatedIndex will be our default volatile mini index store for our
next production release.

The need for serialization is deprecated now, we prefer to re-index
pre-analyzed token stream and keep control of bits with Externalizable.

Whe should have other needs of this index but the lack of addIndexes
support make it impossible for us to use it in other situations. So we
continue to use RAMDirectory in such situations.

Do you think we could reach RAMDirectory index time by tweaking some initialCap
stuff inside java.util.Collections you use?

Many thanks for your excellent work.

PS. I posted some (not really usefull) debug output to the lucene-users
ml.

On Fri, Dec 12, 2008 at 04:15:56PM +0100, Karl Wettin wrote:
> Hi David,
>
> the problems you reported are now committed to the trunk. As  
> InstantiatedIndex is a new module it would be very intersting to hear  
> how InstantiatedIndex works for you and perhaps a little bit about how  
> you use it.
>
>
>     karl
>
> 19 nov 2008 kl. 15.02 skrev David Causse:
>
> > Hi Karl,
> >
> > The reset() problem is not very problematic I can adapt our  
> > TokenStreams.
> > For the Serialization : as we need to share very small indexes (200  
> > docs max) in a cluster we need to serialize something.
> > I was planning to use the Java Serialization with maybe some  
> > compression on the resulting byte[] and as InstantiatedIndex is  
> > Serializable I was hoping to use the perf gain of your implementation 
> > in our context.
> > I will fix my working copy as you suggested.
> >
> > Thank you.
> >
> > David.
> >
> > karl wettin a écrit :
> >> Hi David,
> >>
> >> thanks for the report! I suppose you speak of IndexWriter vs
> >> InstantiatedIndexWriter? These are definitely considered discrepancy
> >> problems. I've created a new issue in the tracker:
> >> http://issues.apache.org/jira/browse/LUCENE-1462
> >>
> >> For what reason do you try to serialize the InstantatedIndex? Could
> >> you perhaps use FSDirectory and IndexWriter instead, and then each
> >> time you update that index you replace your InstantiatedIndex with a
> >> new one constructed using the IndexReader argumented constructor of
> >> InstantiatedIndex?
> >>
> >> I'm afraid that I'm rather busy at the moment but I'll try to fix it
> >> ASAP. It should however be rather easy to fix if you just want to
> >> solve the specific problem: reset all pre-tokenized streams before
> >> they are tokenized in InstantiatedIndexWriter#addDocument and make
> >> TermVectorOffsetInfo implement Serializable.
> >>
> >>
> >>     karl
> >>
> >> On Wed, Nov 19, 2008 at 11:00 AM, David Causse <dcau...@spotter.com> 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Here are some differences I noticed between InstanciatedIndex and
> >>> RAMDirectory :
> >>>
> >>> - RAMDirectory seems to do a reset on tokenStreams the first time,  
> >>> this
> >>> permits to initialise some objects before starting streaming,
> >>> InstanciatedIndex does not.
> >>> - I can Serialize a RAMDirectory but I cannot on a  
> >>> InstantiatedIndex because
> >>> of : java.io.NotSerializableException:
> >>> org.apache.lucene.index.TermVectorOffsetInfo
> >>>
> >>> Do you consider this as problems or normal features?
> >>>
> >>> Thank you.
> >>>
> >>> David.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>

-- 
David Causse
Spotter
http://www.spotter.com/

----- End forwarded message -----

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Forwarded: InstantiatedIndex questions

Reply via email to