I think this is great to see a detailed analysis on a simple one-liner code. Many thanks...
Can we have a blur-config switch to decide whether background segment-merges in HDFS should go via CFS with documentation of the double-write issue? Applications facing open-file handle issues can at-least temporarily alleviate the problem... The default can always be non-CFS -- Ravi On Mon, Sep 22, 2014 at 11:54 PM, Aaron McCurry <[email protected]> wrote: > The biggest reason for using compound files in Lucene is to lower the > number of open file handles in Linux based systems. However this comes at > a cost of double writing the data. Once for writing the files normally > once for writing them into the compound file. Also once the segment sizes > get large the compound file is turned off because of the double write > issue. > > While writing data in Blur the number of open files are still a concern but > less so, let me explain. > > During the MR Bulk Ingest: > > The index is built using an output format and then optimized during the > copy from the local indexing reducer to hdfs. So you end up with a fully > optimized index or one additional (hopefully large) segment to be added to > the main index. So compound files here will only slow down the copy and if > they are larger enough Lucene wouldn't create them anyway. > > During NRT updates: > > The normal process for Blur is use the JoinDirectory which merges short > term and long term storage into a single directory. The long term storage > is typically the HdfsDirectory and segments are only written here once they > have been through the merge scheduler. This mean blocking merge due to NRT > updates or flushes (these are both small merges) are written to short term > storage instead of the slower long term shortage. The short term storage > is a directory backed by the HdfsKeyValueStore. This store writes all the > logical files in a single log style file and syncs to hdfs when commit is > called. So in a sense the the HdfsKeyValueStore is a compound file writer, > it just doesn't have to write the data twice. > > So that's why the compound file feature of Lucene is disabled in Blur. > > Does that answer your question? > > Aaron > > On Monday, September 22, 2014, Ravikumar Govindarajan < > [email protected]> wrote: > > > I came across the code in BlurIndexSimpleWriter that is using > > mergePolicy.setUseCompoundFile(false); > > > > Are there particular reasons for avoiding Compound-file, because lucene > > encourages to use these Compound-files by default? > > > > -- > > Ravi > > >
