[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and "global" cross indices control
[ https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766430#comment-13766430 ] caviler commented on LUCENE-3425: - Because inside IndexSearcher.search method, it use thread pool to execute Segment.search in every segments, so, if some segment is too big, it will causes this big segment's searcher too slow, eventually entire search method will too slow. example, we have 2G index files. = use AverageMergePolicy, IndexSearcher.search spent time = 1s segment sizesegment.search spent time 1 500M 1s 2 500M 1s 3 500M 1s 4 500M 1s = use other MergePolicy, IndexSearcher.search spent time = 5s segment size segment.search spent time 1 2000M5s = Why not use LogByteSizeMergePolicy but AverageMergePolicy? Because: 1. I want every semgent as small as possible! 2. I want semgent as more as possible! we don't known how big of one segment size when entire index is growing, so can't use LogByteSizeMergePolicy. if use LogByteSizeMergePolicy and setMaxMergeMB(200M): index = 200M, LogByteSizeMergePolicy = 1 segment(per 200M), AverageMergePolicy = 4 segments(per 50M) index = 2000M, LogByteSizeMergePolicy = 10 segment(per 200M), AverageMergePolicy = 4 segments(per 500M) > NRT Caching Dir to allow for exact memory usage, better buffer allocation and > "global" cross indices control > > > Key: LUCENE-3425 > URL: https://issues.apache.org/jira/browse/LUCENE-3425 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.4, 4.0-ALPHA >Reporter: Shay Banon > Fix For: 5.0, 4.5 > > > A discussion on IRC raised several improvements that can be made to NRT > caching dir. Some of the problems it currently has are: > 1. Not explicitly controlling the memory usage, which can result in overusing > memory (for example, large new segments being committed because refreshing is > too far behind). > 2. Heap fragmentation because of constant allocation of (probably promoted to > old gen) byte buffers. > 3. Not being able to control the memory usage across indices for multi index > usage within a single JVM. > A suggested solution (which still needs to be ironed out) is to have a > BufferAllocator that controls allocation of byte[], and allow to return > unused byte[] to it. It will have a cap on the size of memory it allows to be > allocated. > The NRT caching dir will use the allocator, which can either be provided (for > usage across several indices) or created internally. The caching dir will > also create a wrapped IndexOutput, that will flush to the main dir if the > allocator can no longer provide byte[] (exhausted). > When a file is "flushed" from the cache to the main directory, it will return > all the currently allocated byte[] to the BufferAllocator to be reused by > other "files". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and "global" cross indices control
[ https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766169#comment-13766169 ] caviler commented on LUCENE-3425: - Yes, i changed the default MergePolicy to AverageMergePolicy, i closed old reader after reopenIfChanged, before upgrade to Lucene 4.4.0, we use this AverageMergePolicy is work well, recently we upgrade to Lucene 4.4.0 and modified this AverageMergePolicy to adapt version 4.4.0, after this, in testing(in very short time we add a lot of documents), we noticed the small segments is so many(more than 40,000), and those small segments seems not have chance to be merged? What difference between version 3.6.2 and 4.4.0 behavior in merge process??? our index files about 72G, split to 148 segment by AverageMergePolicy, every segment about 500M size. {code} public class AverageMergePolicy extends MergePolicy { /** * Default noCFSRatio. If a merge's size is >= 10% of the index, then we * disable compound file for it. * * @see #setNoCFSRatio */ public static final double DEFAULT_NO_CFS_RATIO = 0.1; private long maxSegmentSizeMB = 100L;// protected double noCFSRatio = DEFAULT_NO_CFS_RATIO; private booleanpartialExpunge = false; // protected boolean useCompoundFile = true; public AverageMergePolicy() { } @Override public void close() { } @Override public MergeSpecification findForcedDeletesMerges(final SegmentInfos infos) throws CorruptIndexException, IOException { try { // SegmentInfoPerCommit best = null; final int numSegs = infos.size(); for (int i = 0; i < numSegs; i++) { final SegmentInfoPerCommit info = infos.info(i); if (info.hasDeletions()) { if (null == best || info.getDelCount() > best.getDelCount()) { best = info; } } } final Collection mergingSegments = writer.get().getMergingSegments(); if (mergingSegments.contains(best)) { return null; // skip merging segment } final MergeSpecification spec = new MergeSpecification(); if (null != best) { spec.add(new OneMerge(Collections.singletonList(best))); } return spec; } catch (final Throwable e) { e.printStackTrace(); return null; } } @Override public MergeSpecification findForcedMerges(final SegmentInfos infos, final int maxNumSegments, final Map segmentsToMerge) throws IOException { return findMerges(MergeTrigger.EXPLICIT, infos); } @Override public MergeSpecification findMerges(final MergeTrigger mergeTrigger, final SegmentInfos infos) throws IOException { // partialExpunge = false; // // partialExpunge = true; final long maxSegSize = maxSegmentSizeMB * 1024L * 1024L; // long bestSegSize = maxSegSize; // try { final int numSegs = infos.size(); int numBestSegs = numSegs; { // SegmentInfoPerCommit info; long totalSegSize = 0; // compute the total size of segments for (int i = 0; i < numSegs; i++) { info = infos.info(i); final long size = size(info); totalSegSize += size; } numBestSegs = (int) ((totalSegSize + bestSegSize - 1) / bestSegSize); // bestSegSize = (numBestSegs == 0) ? totalSegSize : (totalSegSize + maxSegSize - 1) / numBestSegs; // if (bestSegSize > maxSegSize) { bestSegSize = maxSegSize; // numBestSegs = (int) ((totalSegSize + maxSegSize - 1) / bestSegSize); } } MergeSpecification spec = findOneMerge(infos, bestSegSize); //int branch = 0; if (null == spec && partialExpunge) { //branch = 1; // final OneMerge expunge = findOneSegmentToExpunge(infos, 0); if (expunge != null) { spec = new MergeSpecification(); spec.add(expunge); } } // MergeLogger.collect(branch, spec); Application.sleep(100); //
[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and "global" cross indices control
[ https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765217#comment-13765217 ] caviler commented on LUCENE-3425: - Actually, in my case, using NRTCachingDir always causes OutOfMemory error! after it happened, we analysed a gc.hprof file by Eclipse MAT, we believe that: core segment reader will hold a big size of RAMFile instance that created by NRTCachingDir.createOutput, if core segment reader not merged or closed, those RAMFile instances will use more and more buffer, but never have chance to uncache from NRTCachingDir instance, so, when all segment reader allocated buffer size is bigger than we have, it will causes OutOfMemory error! > NRT Caching Dir to allow for exact memory usage, better buffer allocation and > "global" cross indices control > > > Key: LUCENE-3425 > URL: https://issues.apache.org/jira/browse/LUCENE-3425 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.4, 4.0-ALPHA >Reporter: Shay Banon > Fix For: 5.0, 4.5 > > > A discussion on IRC raised several improvements that can be made to NRT > caching dir. Some of the problems it currently has are: > 1. Not explicitly controlling the memory usage, which can result in overusing > memory (for example, large new segments being committed because refreshing is > too far behind). > 2. Heap fragmentation because of constant allocation of (probably promoted to > old gen) byte buffers. > 3. Not being able to control the memory usage across indices for multi index > usage within a single JVM. > A suggested solution (which still needs to be ironed out) is to have a > BufferAllocator that controls allocation of byte[], and allow to return > unused byte[] to it. It will have a cap on the size of memory it allows to be > allocated. > The NRT caching dir will use the allocator, which can either be provided (for > usage across several indices) or created internally. The caching dir will > also create a wrapped IndexOutput, that will flush to the main dir if the > allocator can no longer provide byte[] (exhausted). > When a file is "flushed" from the cache to the main directory, it will return > all the currently allocated byte[] to the BufferAllocator to be reused by > other "files". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org