I do understand the attack of the clones in lucene and the expensiveness of
setting-up an FSDataInputStream.
But most of lucene's code have this in their close method
if(!isClone) {
//Close the underyling-resource...
}
They actually don't seem to worry about any open clones at this point.. Can
we not do something like this and close the FSDataInputStream?
Also, do we occasionally close our IndexSearchers in Blur or they will
always be refreshed from time-to-time?
--
Ravi
On Thu, Sep 25, 2014 at 5:20 PM, Aaron McCurry <[email protected]> wrote:
> On Thu, Sep 25, 2014 at 1:26 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > There was some bug on our side because of which open file handles
> shot-up.
> > It will eventually show-up even if we use CFS, but it lead me to some
> basic
> > questions.
> >
> > Also, there is another question I have. Why we don't close
> > FSDataInputStream when a HdfsIndexInput is closed?
> >
>
> The FSDataInputStream is very heavy to create, basically any call that
> requires a trip to the namenode is very heavy. Lucene actually clones the
> IndexInput objects all over the place to allow for concurrent access to the
> same file (Lucene framework doesn't actually close these cloned objects).
> In the HdfsDirectory all accesses to the same file go through the same file
> handle (aka FSDataInputStream instance) and uses the read with position
> calls that are thread safe. So when a clone occurs we need to be as fast
> as possible, same with calling open(), so that's why we cache them
> centrally in the HdfsDirectory. The current assumption in Blur is that you
> need access to all the files in the index and so when a file is newly
> written it is opened (or if the file is needed and is already open) and
> when the file is removed it is closed.
>
> We could improve this by providing some timeout feature that if the file
> has not been accessed in some time we could close the handle.
>
> Does this explain what is happening and why?
>
> Thanks!
>
> Aaron
>
>
> >
> > I find that we close FSDataInputStream only during a file-delete
> > [HdfsDirectory.delete()].
> >
> > Is there a reason for doing so...
> >
> > --
> > Ravi
> >
> > On Thu, Sep 25, 2014 at 6:28 AM, Aaron McCurry <[email protected]>
> wrote:
> >
> > > On Wed, Sep 24, 2014 at 2:38 AM, Ravikumar Govindarajan <
> > > [email protected]> wrote:
> > >
> > > > I think this is great to see a detailed analysis on a simple
> one-liner
> > > > code. Many thanks...
> > > >
> > > > Can we have a blur-config switch to decide whether background
> > > > segment-merges in HDFS should go via CFS with documentation of the
> > > > double-write issue? Applications facing open-file handle issues can
> > > > at-least temporarily alleviate the problem...
> > > >
> > >
> > > I don't see why not. However I would like to understand why it is a
> > > problem if you are using the HDFS directories from Blur, because if you
> > are
> > > then you really shouldn't have a problem with open file handles.
> > >
> > > Aaron
> > >
> > >
> > > >
> > > > The default can always be non-CFS
> > > >
> > > > --
> > > > Ravi
> > > >
> > > > On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]>
> > > > wrote:
> > > >
> > > > > The biggest reason for using compound files in Lucene is to lower
> the
> > > > > number of open file handles in Linux based systems. However this
> > comes
> > > > at
> > > > > a cost of double writing the data. Once for writing the files
> > normally
> > > > > once for writing them into the compound file. Also once the
> segment
> > > > sizes
> > > > > get large the compound file is turned off because of the double
> write
> > > > > issue.
> > > > >
> > > > > While writing data in Blur the number of open files are still a
> > concern
> > > > but
> > > > > less so, let me explain.
> > > > >
> > > > > During the MR Bulk Ingest:
> > > > >
> > > > > The index is built using an output format and then optimized during
> > the
> > > > > copy from the local indexing reducer to hdfs. So you end up with a
> > > fully
> > > > > optimized index or one additional (hopefully large) segment to be
> > added
> > > > to
> > > > > the main index. So compound files here will only slow down the
> copy
> > > and
> > > > if
> > > > > they are larger enough Lucene wouldn't create them anyway.
> > > > >
> > > > > During NRT updates:
> > > > >
> > > > > The normal process for Blur is use the JoinDirectory which merges
> > short
> > > > > term and long term storage into a single directory. The long term
> > > > storage
> > > > > is typically the HdfsDirectory and segments are only written here
> > once
> > > > they
> > > > > have been through the merge scheduler. This mean blocking merge
> due
> > to
> > > > NRT
> > > > > updates or flushes (these are both small merges) are written to
> short
> > > > term
> > > > > storage instead of the slower long term shortage. The short term
> > > storage
> > > > > is a directory backed by the HdfsKeyValueStore. This store writes
> > all
> > > > the
> > > > > logical files in a single log style file and syncs to hdfs when
> > commit
> > > is
> > > > > called. So in a sense the the HdfsKeyValueStore is a compound file
> > > > writer,
> > > > > it just doesn't have to write the data twice.
> > > > >
> > > > > So that's why the compound file feature of Lucene is disabled in
> > Blur.
> > > > >
> > > > > Does that answer your question?
> > > > >
> > > > > Aaron
> > > > >
> > > > > On Monday, September 22, 2014, Ravikumar Govindarajan <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > I came across the code in BlurIndexSimpleWriter that is using
> > > > > > mergePolicy.setUseCompoundFile(false);
> > > > > >
> > > > > > Are there particular reasons for avoiding Compound-file, because
> > > lucene
> > > > > > encourages to use these Compound-files by default?
> > > > > >
> > > > > > --
> > > > > > Ravi
> > > > > >
> > > > >
> > > >
> > >
> >
>