There was some bug on our side because of which open file handles shot-up. It will eventually show-up even if we use CFS, but it lead me to some basic questions.
Also, there is another question I have. Why we don't close FSDataInputStream when a HdfsIndexInput is closed? I find that we close FSDataInputStream only during a file-delete [HdfsDirectory.delete()]. Is there a reason for doing so... -- Ravi On Thu, Sep 25, 2014 at 6:28 AM, Aaron McCurry <[email protected]> wrote: > On Wed, Sep 24, 2014 at 2:38 AM, Ravikumar Govindarajan < > [email protected]> wrote: > > > I think this is great to see a detailed analysis on a simple one-liner > > code. Many thanks... > > > > Can we have a blur-config switch to decide whether background > > segment-merges in HDFS should go via CFS with documentation of the > > double-write issue? Applications facing open-file handle issues can > > at-least temporarily alleviate the problem... > > > > I don't see why not. However I would like to understand why it is a > problem if you are using the HDFS directories from Blur, because if you are > then you really shouldn't have a problem with open file handles. > > Aaron > > > > > > The default can always be non-CFS > > > > -- > > Ravi > > > > On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]> > > wrote: > > > > > The biggest reason for using compound files in Lucene is to lower the > > > number of open file handles in Linux based systems. However this comes > > at > > > a cost of double writing the data. Once for writing the files normally > > > once for writing them into the compound file. Also once the segment > > sizes > > > get large the compound file is turned off because of the double write > > > issue. > > > > > > While writing data in Blur the number of open files are still a concern > > but > > > less so, let me explain. > > > > > > During the MR Bulk Ingest: > > > > > > The index is built using an output format and then optimized during the > > > copy from the local indexing reducer to hdfs. So you end up with a > fully > > > optimized index or one additional (hopefully large) segment to be added > > to > > > the main index. So compound files here will only slow down the copy > and > > if > > > they are larger enough Lucene wouldn't create them anyway. > > > > > > During NRT updates: > > > > > > The normal process for Blur is use the JoinDirectory which merges short > > > term and long term storage into a single directory. The long term > > storage > > > is typically the HdfsDirectory and segments are only written here once > > they > > > have been through the merge scheduler. This mean blocking merge due to > > NRT > > > updates or flushes (these are both small merges) are written to short > > term > > > storage instead of the slower long term shortage. The short term > storage > > > is a directory backed by the HdfsKeyValueStore. This store writes all > > the > > > logical files in a single log style file and syncs to hdfs when commit > is > > > called. So in a sense the the HdfsKeyValueStore is a compound file > > writer, > > > it just doesn't have to write the data twice. > > > > > > So that's why the compound file feature of Lucene is disabled in Blur. > > > > > > Does that answer your question? > > > > > > Aaron > > > > > > On Monday, September 22, 2014, Ravikumar Govindarajan < > > > [email protected]> wrote: > > > > > > > I came across the code in BlurIndexSimpleWriter that is using > > > > mergePolicy.setUseCompoundFile(false); > > > > > > > > Are there particular reasons for avoiding Compound-file, because > lucene > > > > encourages to use these Compound-files by default? > > > > > > > > -- > > > > Ravi > > > > > > > > > >
