[ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733822#action_12733822 ]
Michael McCandless commented on LUCENE-1076: -------------------------------------------- bq. if I merge two consecutive segments, how come I don't merge their doc stores Multiple segments are able to share a single set of doc-store (= stored fields & term vectors) files, today. This only happens when multiple segments are written in a single IndexWriter session with autoCommit=false. EG if I open a writer, index all of wikipedia w/ autoCommit false, and close it, you'll see a single large set of doc store files (eg _0.fdt, _0.fdx, _0.tvf, _0.tvd, _0.tvx). Whenever RAM is full (with postings & norms data), a new segment is flushed, but the doc store files are kept open & shared with further flushed segments. A single segment then refers to the shared doc stores, but records its "offset" within them. So, when we merge contiguous segments, because the resulting docs are also contiguous in the doc stores, we are able to store a single doc store offset in the merged segment, referencing the orignial doc store, and it works fine. But if we merge non-contiguous segments, we must then pull out & merge the "slices" from the doc stores into a new [private to the new segment] set of doc store files. For apps that store term vectors w/ positions & offsets, and have many stored fields, and have heterogenous field name -> number assignments (see LUCENE-1737 to fix that), the merging of doc stores can easily dominate the merge cost. > Allow MergePolicy to select non-contiguous merges > ------------------------------------------------- > > Key: LUCENE-1076 > URL: https://issues.apache.org/jira/browse/LUCENE-1076 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.3 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1076.patch > > > I started work on this but with LUCENE-1044 I won't make much progress > on it for a while, so I want to checkpoint my current state/patch. > For backwards compatibility we must leave the default MergePolicy as > selecting contiguous merges. This is necessary because some > applications rely on "temporal monotonicity" of doc IDs, which means > even though merges can re-number documents, the renumbering will > always reflect the order in which the documents were added to the > index. > Still, for those apps that do not rely on this, we should offer a > MergePolicy that is free to select the best merges regardless of > whether they are continuguous. This requires fixing IndexWriter to > accept such a merge, and, fixing LogMergePolicy to optionally allow > it the freedom to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org