Re: Compound-files and Blur....

Ravikumar Govindarajan Wed, 24 Sep 2014 22:27:41 -0700

There was some bug on our side because of which open file handles shot-up.
It will eventually show-up even if we use CFS, but it lead me to some basic
questions.


Also, there is another question I have. Why we don't close
FSDataInputStream when a HdfsIndexInput is closed?

I find that we close FSDataInputStream only during a file-delete
[HdfsDirectory.delete()].

Is there a reason for doing so...

--
Ravi

On Thu, Sep 25, 2014 at 6:28 AM, Aaron McCurry <[email protected]> wrote:

> On Wed, Sep 24, 2014 at 2:38 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I think this is great to see a detailed analysis on a simple one-liner
> > code. Many thanks...
> >
> > Can we have a blur-config switch to decide whether background
> > segment-merges in HDFS should go via CFS with documentation of the
> > double-write issue? Applications facing open-file handle issues can
> > at-least temporarily alleviate the problem...
> >
>
> I don't see why not.  However I would like to understand why it is a
> problem if you are using the HDFS directories from Blur, because if you are
> then you really shouldn't have a problem with open file handles.
>
> Aaron
>
>
> >
> > The default can always be non-CFS
> >
> > --
> > Ravi
> >
> > On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]>
> > wrote:
> >
> > > The biggest reason for using compound files in Lucene is to lower the
> > > number of open file handles in Linux based systems.  However this comes
> > at
> > > a cost of double writing the data.  Once for writing the files normally
> > > once for writing them into the compound file.  Also once the segment
> > sizes
> > > get large the compound file is turned off because of the double write
> > > issue.
> > >
> > > While writing data in Blur the number of open files are still a concern
> > but
> > > less so, let me explain.
> > >
> > > During the MR Bulk Ingest:
> > >
> > > The index is built using an output format and then optimized during the
> > > copy from the local indexing reducer to hdfs.  So you end up with a
> fully
> > > optimized index or one additional (hopefully large) segment to be added
> > to
> > > the main index.  So compound files here will only slow down the copy
> and
> > if
> > > they are larger enough Lucene wouldn't create them anyway.
> > >
> > > During NRT updates:
> > >
> > > The normal process for Blur is use the JoinDirectory which merges short
> > > term and long term storage into a single directory.  The long term
> > storage
> > > is typically the HdfsDirectory and segments are only written here once
> > they
> > > have been through the merge scheduler.  This mean blocking merge due to
> > NRT
> > > updates or flushes (these are both small merges) are written to short
> > term
> > > storage instead of the slower long term shortage.  The short term
> storage
> > > is a directory backed by the HdfsKeyValueStore.  This store writes all
> > the
> > > logical files in a single log style file and syncs to hdfs when commit
> is
> > > called.  So in a sense the the HdfsKeyValueStore is a compound file
> > writer,
> > > it just doesn't have to write the data twice.
> > >
> > > So that's why the compound file feature of Lucene is disabled in Blur.
> > >
> > > Does that answer your question?
> > >
> > > Aaron
> > >
> > > On Monday, September 22, 2014, Ravikumar Govindarajan <
> > > [email protected]> wrote:
> > >
> > > > I came across the code in BlurIndexSimpleWriter that is using
> > > > mergePolicy.setUseCompoundFile(false);
> > > >
> > > > Are there particular reasons for avoiding Compound-file, because
> lucene
> > > > encourages to use these Compound-files by default?
> > > >
> > > > --
> > > > Ravi
> > > >
> > >
> >
>

Re: Compound-files and Blur....

Reply via email to