On Wed, Sep 24, 2014 at 2:38 AM, Ravikumar Govindarajan <
[email protected]> wrote:

> I think this is great to see a detailed analysis on a simple one-liner
> code. Many thanks...
>
> Can we have a blur-config switch to decide whether background
> segment-merges in HDFS should go via CFS with documentation of the
> double-write issue? Applications facing open-file handle issues can
> at-least temporarily alleviate the problem...
>

I don't see why not.  However I would like to understand why it is a
problem if you are using the HDFS directories from Blur, because if you are
then you really shouldn't have a problem with open file handles.

Aaron


>
> The default can always be non-CFS
>
> --
> Ravi
>
> On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]>
> wrote:
>
> > The biggest reason for using compound files in Lucene is to lower the
> > number of open file handles in Linux based systems.  However this comes
> at
> > a cost of double writing the data.  Once for writing the files normally
> > once for writing them into the compound file.  Also once the segment
> sizes
> > get large the compound file is turned off because of the double write
> > issue.
> >
> > While writing data in Blur the number of open files are still a concern
> but
> > less so, let me explain.
> >
> > During the MR Bulk Ingest:
> >
> > The index is built using an output format and then optimized during the
> > copy from the local indexing reducer to hdfs.  So you end up with a fully
> > optimized index or one additional (hopefully large) segment to be added
> to
> > the main index.  So compound files here will only slow down the copy and
> if
> > they are larger enough Lucene wouldn't create them anyway.
> >
> > During NRT updates:
> >
> > The normal process for Blur is use the JoinDirectory which merges short
> > term and long term storage into a single directory.  The long term
> storage
> > is typically the HdfsDirectory and segments are only written here once
> they
> > have been through the merge scheduler.  This mean blocking merge due to
> NRT
> > updates or flushes (these are both small merges) are written to short
> term
> > storage instead of the slower long term shortage.  The short term storage
> > is a directory backed by the HdfsKeyValueStore.  This store writes all
> the
> > logical files in a single log style file and syncs to hdfs when commit is
> > called.  So in a sense the the HdfsKeyValueStore is a compound file
> writer,
> > it just doesn't have to write the data twice.
> >
> > So that's why the compound file feature of Lucene is disabled in Blur.
> >
> > Does that answer your question?
> >
> > Aaron
> >
> > On Monday, September 22, 2014, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > I came across the code in BlurIndexSimpleWriter that is using
> > > mergePolicy.setUseCompoundFile(false);
> > >
> > > Are there particular reasons for avoiding Compound-file, because lucene
> > > encourages to use these Compound-files by default?
> > >
> > > --
> > > Ravi
> > >
> >
>

Reply via email to