There are two separate problems that I know of in indexing parts of PDFs in an overlapping way:
1) block-structured documents of a) the entire PDF file b) chapters c) sections of chapters d.....z) 2) Tracking the set of pages that each document contains. As I understand this, LUCENE-2324 handles the first case but not the second. True? On Sat, May 8, 2010 at 10:37 AM, Michael Busch <busch...@gmail.com> wrote: > On 5/8/10 3:10 AM, Mark Harwood wrote: >> >> The downside is the need to maintain sequences of related docs in the same >> segment - something Lucene currently doesn't make easy with its limited >> control over when segments are flushed. I suspect we'll need some discussion >> on how best to support this. >> > > LUCENE-2324 should help to make this work even when you add documents with > multiple threads. There will be one DocumentsWriter per thread (DWPT), and > the different DWPTs will write to their own segments. We will also have an > extension point to control thread binding. Then you can make sure that all > parts of your compound document end up sequentially in the same segment. > > One thing we have to make sure though is that a DWPT doesn't flush "between" > different parts of your compound doc. Hmm, we might have to add a "flush > policy" to our growing family of policies. > > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lance Norskog goks...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org