I think this is great to see a detailed analysis on a simple one-liner
code. Many thanks...

Can we have a blur-config switch to decide whether background
segment-merges in HDFS should go via CFS with documentation of the
double-write issue? Applications facing open-file handle issues can
at-least temporarily alleviate the problem...

The default can always be non-CFS

--
Ravi

On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]> wrote:

> The biggest reason for using compound files in Lucene is to lower the
> number of open file handles in Linux based systems.  However this comes at
> a cost of double writing the data.  Once for writing the files normally
> once for writing them into the compound file.  Also once the segment sizes
> get large the compound file is turned off because of the double write
> issue.
>
> While writing data in Blur the number of open files are still a concern but
> less so, let me explain.
>
> During the MR Bulk Ingest:
>
> The index is built using an output format and then optimized during the
> copy from the local indexing reducer to hdfs.  So you end up with a fully
> optimized index or one additional (hopefully large) segment to be added to
> the main index.  So compound files here will only slow down the copy and if
> they are larger enough Lucene wouldn't create them anyway.
>
> During NRT updates:
>
> The normal process for Blur is use the JoinDirectory which merges short
> term and long term storage into a single directory.  The long term storage
> is typically the HdfsDirectory and segments are only written here once they
> have been through the merge scheduler.  This mean blocking merge due to NRT
> updates or flushes (these are both small merges) are written to short term
> storage instead of the slower long term shortage.  The short term storage
> is a directory backed by the HdfsKeyValueStore.  This store writes all the
> logical files in a single log style file and syncs to hdfs when commit is
> called.  So in a sense the the HdfsKeyValueStore is a compound file writer,
> it just doesn't have to write the data twice.
>
> So that's why the compound file feature of Lucene is disabled in Blur.
>
> Does that answer your question?
>
> Aaron
>
> On Monday, September 22, 2014, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I came across the code in BlurIndexSimpleWriter that is using
> > mergePolicy.setUseCompoundFile(false);
> >
> > Are there particular reasons for avoiding Compound-file, because lucene
> > encourages to use these Compound-files by default?
> >
> > --
> > Ravi
> >
>

Reply via email to