Questions about doc store files (.cfx)

2009-11-09 Thread Michael Busch
Hi, I'm wondering about the benefits of having the .cfx files. The main advantage is that you avoid merging (copying) stored fields and TermVectors during segment merge, right? And I think .cfx files are only shared across segments if the same IndexWriter is used to flush multiple segments an

Re: Questions about doc store files (.cfx)

2009-11-09 Thread Michael McCandless
I think you're asking about the benefit of using "shared doc stores" at all? CFX is just the compound format of these shared files; if compound file is off, then they are still shared, just as separate (.fdx/t, .tvx/d/f) files. For building up a single large index, I suspect the win is sizable, i

Re: Questions about doc store files (.cfx)

2009-11-09 Thread Michael Busch
On 11/9/09 2:56 AM, Michael McCandless wrote: I think you're asking about the benefit of using "shared doc stores" at all? CFX is just the compound format of these shared files; if compound file is off, then they are still shared, just as separate (.fdx/t, .tvx/d/f) files. Oh yeah, that's

Re: Questions about doc store files (.cfx)

2009-11-09 Thread Michael McCandless
On Mon, Nov 9, 2009 at 10:10 AM, Michael Busch wrote: >> I think you're asking about the benefit of using "shared doc stores" at >> all? >> >> CFX is just the compound format of these shared files; if compound >> file is off, then they are still shared, just as separate (.fdx/t, >> .tvx/d/f) files

Re: Questions about doc store files (.cfx)

2009-11-09 Thread Michael Busch
On 11/9/09 9:00 AM, Michael McCandless wrote: Alas, I don't have any benchmarks offhand... if you want to run one, you should be able to hardwire flushDocStores=true in IndexWriter.doFlushInternal? I think that'd turn off the sharing without breaking things (run the tests to be sure ;) ).

Re: Questions about doc store files (.cfx)

2009-11-09 Thread Michael Busch
On 11/9/09 5:40 PM, Michael Busch wrote: I think that should be ok with parallel indexing, as long as we can always select all corresponding segments from *all* parallel indexes for a merge to keep the docIds in sync. That actually leads me to another question: Let's say you have three segmen

Re: Questions about doc store files (.cfx)

2009-11-10 Thread Michael McCandless
On Tue, Nov 10, 2009 at 12:06 AM, Michael Busch wrote: > On 11/9/09 5:40 PM, Michael Busch wrote: >> >> I think that should be ok with parallel indexing, as long as we can always >> select all corresponding segments from *all* parallel indexes for a merge to >> keep the docIds in sync. >> >> That

Re: Questions about doc store files (.cfx)

2009-11-10 Thread Michael Busch
On 11/10/09 1:57 AM, Michael McCandless wrote: I think this is exactly what happens? I wrote a small test program that creates a situation like mentioned above in the "expungeDelete" scenario. It ends up with a docstore containing docs from two segments, but after expungeDeletes only one segmen

Re: Questions about doc store files (.cfx)

2009-11-10 Thread Michael McCandless
On Tue, Nov 10, 2009 at 1:18 PM, Michael Busch wrote: > I talked to Marvin on ApacheCon; in Lucy he wants to have all the compound > file support in the store package, separately from the indexer. I think that > would make sense in Lucene too, there's not really the need to have it > tightly inte