I'm also very curious how your index will be affected by upgrading to flex.
I know the terms index gets alot smaller :) But I'm curious about the query performance... I haven't seen that paper -- it sounds very interesting! I'll add to the list to read... Mike On Tue, Oct 5, 2010 at 11:12 AM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Mike, > > I suspected the approach might require architectural changes beyond flex, but > since our indexes are so huge and disk I/O is our main bottleneck both for > searching and indexing, I'm always looking for ways to deal with very large > postings and positions lists that might reduce I/O. > > I haven't looked in detail into PFOR and Simple9 and some of the other new > encodings, but my understanding is that they trade off compression for > decompression speed. i.e. they take up a bit more space, but are more > efficient to decompress. In our case, where we have underutilized CPU, > mostly because the processors are waiting on disk I/O, I'll be curious to > find out whether the slight increase in disk I/O time due to lower > compression is still outweighed by the increase in decompression speed. > (Don't know if we'll find the time to try flex for a while though:) > > > BTW: have you seen this paper looking at 64-bit words? > "Index Compression Using 64-Bit Words", Anh, Moffat. Software -- Practice & > Experience, 40(2):131-148, February 2010 > > > Tom > -----Original Message----- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Tuesday, October 05, 2010 6:21 AM > To: dev@lucene.apache.org > Subject: Re: Flex indexing : Hybrid index maintnenance for faster indexing > > Nice paper! > > It's a neat trick to index the large postings as separate files, ie > let the fileystem handle the growth as new postings are appended > over time. > > But, unfortunately, we can't easily do this in Lucene, since Lucene > assumes index files are write once, and derives its transactional > semantics from this approach. Ie, this would require sizable changes, > beyond just swapping in a different Codec. > > Still, the idea that small/big postings lists should be handled > differently is something we can take advantage of in a Codec, and I > think we should. I think likely we will switch to a default codec > that uses pulsing (storing term's postiugs directly in terms dict) for > very low freq terms, maybe vInt for medium freq terms, and FOR/PFOR > for high freq terms. > > Mike > > On Mon, Oct 4, 2010 at 6:42 PM, Burton-West, Tom <tburt...@umich.edu> wrote: >> Hi all, >> >> Would it be possible to implement something like this in Flex? >> >> >> Büttcher, S., & Clarke, C. L. A. (2008). Hybrid index maintenance for >> contiguous inverted lists. Information Retrieval, 11(3), 175-207. >> doi:10.1007/s10791-007-9042-8 >> >> The approach takes advantage of having a different policy for large postings >> lists (ie frequent terms) versus small postings lists for flushing the >> buffer and writing to disk. >> >> >> Tom Burton-West >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org