The biggest reason for using compound files in Lucene is to lower the
number of open file handles in Linux based systems.  However this comes at
a cost of double writing the data.  Once for writing the files normally
once for writing them into the compound file.  Also once the segment sizes
get large the compound file is turned off because of the double write issue.

While writing data in Blur the number of open files are still a concern but
less so, let me explain.

During the MR Bulk Ingest:

The index is built using an output format and then optimized during the
copy from the local indexing reducer to hdfs.  So you end up with a fully
optimized index or one additional (hopefully large) segment to be added to
the main index.  So compound files here will only slow down the copy and if
they are larger enough Lucene wouldn't create them anyway.

During NRT updates:

The normal process for Blur is use the JoinDirectory which merges short
term and long term storage into a single directory.  The long term storage
is typically the HdfsDirectory and segments are only written here once they
have been through the merge scheduler.  This mean blocking merge due to NRT
updates or flushes (these are both small merges) are written to short term
storage instead of the slower long term shortage.  The short term storage
is a directory backed by the HdfsKeyValueStore.  This store writes all the
logical files in a single log style file and syncs to hdfs when commit is
called.  So in a sense the the HdfsKeyValueStore is a compound file writer,
it just doesn't have to write the data twice.

So that's why the compound file feature of Lucene is disabled in Blur.

Does that answer your question?

Aaron

On Monday, September 22, 2014, Ravikumar Govindarajan <
[email protected]> wrote:

> I came across the code in BlurIndexSimpleWriter that is using
> mergePolicy.setUseCompoundFile(false);
>
> Are there particular reasons for avoiding Compound-file, because lucene
> encourages to use these Compound-files by default?
>
> --
> Ravi
>

Reply via email to