The biggest reason for using compound files in Lucene is to lower the number of open file handles in Linux based systems. However this comes at a cost of double writing the data. Once for writing the files normally once for writing them into the compound file. Also once the segment sizes get large the compound file is turned off because of the double write issue.
While writing data in Blur the number of open files are still a concern but less so, let me explain. During the MR Bulk Ingest: The index is built using an output format and then optimized during the copy from the local indexing reducer to hdfs. So you end up with a fully optimized index or one additional (hopefully large) segment to be added to the main index. So compound files here will only slow down the copy and if they are larger enough Lucene wouldn't create them anyway. During NRT updates: The normal process for Blur is use the JoinDirectory which merges short term and long term storage into a single directory. The long term storage is typically the HdfsDirectory and segments are only written here once they have been through the merge scheduler. This mean blocking merge due to NRT updates or flushes (these are both small merges) are written to short term storage instead of the slower long term shortage. The short term storage is a directory backed by the HdfsKeyValueStore. This store writes all the logical files in a single log style file and syncs to hdfs when commit is called. So in a sense the the HdfsKeyValueStore is a compound file writer, it just doesn't have to write the data twice. So that's why the compound file feature of Lucene is disabled in Blur. Does that answer your question? Aaron On Monday, September 22, 2014, Ravikumar Govindarajan < [email protected]> wrote: > I came across the code in BlurIndexSimpleWriter that is using > mergePolicy.setUseCompoundFile(false); > > Are there particular reasons for avoiding Compound-file, because lucene > encourages to use these Compound-files by default? > > -- > Ravi >
