Re: Compound-files and Blur....

Aaron McCurry Sun, 19 Oct 2014 18:08:47 -0700

On Tue, Sep 30, 2014 at 2:33 AM, Ravikumar Govindarajan <
[email protected]> wrote:


> I do understand the attack of the clones in lucene and the expensiveness of
> setting-up an FSDataInputStream.
>
> But most of lucene's code have this in their close method
>
> if(!isClone) {
>    //Close the underyling-resource...
> }
>
> They actually don't seem to worry about any open clones at this point.. Can
> we not do something like this and close the FSDataInputStream?
>

We could, I was just making a global reference map for all open files
(FSDataInputStreams).  So that if the same file is opened more then once
the same reference could be used.  It would interesting to know which
implementation is better for overall performance.


>
> Also, do we occasionally close our IndexSearchers in Blur or they will
> always be refreshed from time-to-time?
>

Whenever the IndexSearchers (or the IndexSearcherClosable) are used they
need to closed or they will leak IndexReader references.  The internal
IndexReader reference counter is used to figure out when the reader can be
closed.

Aaron


>
> --
> Ravi
>
> On Thu, Sep 25, 2014 at 5:20 PM, Aaron McCurry <[email protected]> wrote:
>
> > On Thu, Sep 25, 2014 at 1:26 AM, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > There was some bug on our side because of which open file handles
> > shot-up.
> > > It will eventually show-up even if we use CFS, but it lead me to some
> > basic
> > > questions.
> > >
> > > Also, there is another question I have. Why we don't close
> > > FSDataInputStream when a HdfsIndexInput is closed?
> > >
> >
> > The FSDataInputStream is very heavy to create, basically any call that
> > requires a trip to the namenode is very heavy.  Lucene actually clones
> the
> > IndexInput objects all over the place to allow for concurrent access to
> the
> > same file (Lucene framework doesn't actually close these cloned objects).
> > In the HdfsDirectory all accesses to the same file go through the same
> file
> > handle (aka FSDataInputStream instance) and uses the read with position
> > calls that are thread safe.  So when a clone occurs we need to be as fast
> > as possible, same with calling open(), so that's why we cache them
> > centrally in the HdfsDirectory.  The current assumption in Blur is that
> you
> > need access to all the files in the index and so when a file is newly
> > written it is opened (or if the file is needed and is already open) and
> > when the file is removed it is closed.
> >
> > We could improve this by providing some timeout feature that if the file
> > has not been accessed in some time we could close the handle.
> >
> > Does this explain what is happening and why?
> >
> > Thanks!
> >
> > Aaron
> >
> >
> > >
> > > I find that we close FSDataInputStream only during a file-delete
> > > [HdfsDirectory.delete()].
> > >
> > > Is there a reason for doing so...
> > >
> > > --
> > > Ravi
> > >
> > > On Thu, Sep 25, 2014 at 6:28 AM, Aaron McCurry <[email protected]>
> > wrote:
> > >
> > > > On Wed, Sep 24, 2014 at 2:38 AM, Ravikumar Govindarajan <
> > > > [email protected]> wrote:
> > > >
> > > > > I think this is great to see a detailed analysis on a simple
> > one-liner
> > > > > code. Many thanks...
> > > > >
> > > > > Can we have a blur-config switch to decide whether background
> > > > > segment-merges in HDFS should go via CFS with documentation of the
> > > > > double-write issue? Applications facing open-file handle issues can
> > > > > at-least temporarily alleviate the problem...
> > > > >
> > > >
> > > > I don't see why not.  However I would like to understand why it is a
> > > > problem if you are using the HDFS directories from Blur, because if
> you
> > > are
> > > > then you really shouldn't have a problem with open file handles.
> > > >
> > > > Aaron
> > > >
> > > >
> > > > >
> > > > > The default can always be non-CFS
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > > On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > The biggest reason for using compound files in Lucene is to lower
> > the
> > > > > > number of open file handles in Linux based systems.  However this
> > > comes
> > > > > at
> > > > > > a cost of double writing the data.  Once for writing the files
> > > normally
> > > > > > once for writing them into the compound file.  Also once the
> > segment
> > > > > sizes
> > > > > > get large the compound file is turned off because of the double
> > write
> > > > > > issue.
> > > > > >
> > > > > > While writing data in Blur the number of open files are still a
> > > concern
> > > > > but
> > > > > > less so, let me explain.
> > > > > >
> > > > > > During the MR Bulk Ingest:
> > > > > >
> > > > > > The index is built using an output format and then optimized
> during
> > > the
> > > > > > copy from the local indexing reducer to hdfs.  So you end up
> with a
> > > > fully
> > > > > > optimized index or one additional (hopefully large) segment to be
> > > added
> > > > > to
> > > > > > the main index.  So compound files here will only slow down the
> > copy
> > > > and
> > > > > if
> > > > > > they are larger enough Lucene wouldn't create them anyway.
> > > > > >
> > > > > > During NRT updates:
> > > > > >
> > > > > > The normal process for Blur is use the JoinDirectory which merges
> > > short
> > > > > > term and long term storage into a single directory.  The long
> term
> > > > > storage
> > > > > > is typically the HdfsDirectory and segments are only written here
> > > once
> > > > > they
> > > > > > have been through the merge scheduler.  This mean blocking merge
> > due
> > > to
> > > > > NRT
> > > > > > updates or flushes (these are both small merges) are written to
> > short
> > > > > term
> > > > > > storage instead of the slower long term shortage.  The short term
> > > > storage
> > > > > > is a directory backed by the HdfsKeyValueStore.  This store
> writes
> > > all
> > > > > the
> > > > > > logical files in a single log style file and syncs to hdfs when
> > > commit
> > > > is
> > > > > > called.  So in a sense the the HdfsKeyValueStore is a compound
> file
> > > > > writer,
> > > > > > it just doesn't have to write the data twice.
> > > > > >
> > > > > > So that's why the compound file feature of Lucene is disabled in
> > > Blur.
> > > > > >
> > > > > > Does that answer your question?
> > > > > >
> > > > > > Aaron
> > > > > >
> > > > > > On Monday, September 22, 2014, Ravikumar Govindarajan <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > I came across the code in BlurIndexSimpleWriter that is using
> > > > > > > mergePolicy.setUseCompoundFile(false);
> > > > > > >
> > > > > > > Are there particular reasons for avoiding Compound-file,
> because
> > > > lucene
> > > > > > > encourages to use these Compound-files by default?
> > > > > > >
> > > > > > > --
> > > > > > > Ravi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Compound-files and Blur....

Reply via email to