Re: Flex indexing : Hybrid index maintnenance for faster indexing

Michael McCandless Tue, 05 Oct 2010 16:47:29 -0700

I'm also very curious how your index will be affected by upgrading to flex.


I know the terms index gets alot smaller :)  But I'm curious about the
query performance...

I haven't seen that paper -- it sounds very interesting!  I'll add to
the list to read...

Mike

On Tue, Oct 5, 2010 at 11:12 AM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Mike,
>
> I suspected the approach might require architectural changes beyond flex, but 
> since our indexes are so huge and disk I/O is our main bottleneck both for 
> searching and indexing, I'm always looking for ways to deal with very large 
> postings and positions lists that might reduce I/O.
>
> I haven't looked in detail into PFOR and Simple9 and some of the other new 
> encodings, but my understanding is that they trade off compression for 
> decompression speed. i.e. they take up a bit more space, but are more 
> efficient to decompress.   In our case, where we have underutilized CPU, 
> mostly because the processors are waiting on disk I/O, I'll be curious to 
> find out whether the slight increase in disk I/O time due to lower 
> compression is still outweighed by the increase in decompression speed. 
> (Don't know if we'll find the time to try flex for a while though:)
>
>
> BTW: have you seen this paper looking at 64-bit words?
>  "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & 
> Experience, 40(2):131-148, February 2010
>
>
> Tom
> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Tuesday, October 05, 2010 6:21 AM
> To: dev@lucene.apache.org
> Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing
>
> Nice paper!
>
> It's a neat trick to index the large postings as separate files, ie
> let the fileystem handle the growth as new postings are appended
> over time.
>
> But, unfortunately, we can't easily do this in Lucene, since Lucene
> assumes index files are write once, and derives its transactional
> semantics from this approach.  Ie, this would require sizable changes,
> beyond just swapping in a different Codec.
>
> Still, the idea that small/big postings lists should be handled
> differently is something we can take advantage of in a Codec, and I
> think we should.  I think likely we will switch to a default codec
> that uses pulsing (storing term's postiugs directly in terms dict) for
> very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR
> for high freq terms.
>
> Mike
>
> On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
>> Hi all,
>>
>> Would it be possible to implement something like this in Flex?
>>
>>
>> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for 
>> contiguous inverted lists. Information Retrieval, 11(3), 175-207. 
>> doi:10.1007/s10791-007-9042-8
>>
>> The approach takes advantage of having a different policy for large postings 
>> lists (ie frequent terms)  versus small postings lists for flushing the 
>> buffer and writing to disk.
>>
>>
>> Tom Burton-West
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Flex indexing : Hybrid index maintnenance for faster indexing

Reply via email to