[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767770#comment-13767770
 ] 

Michael McCandless commented on LUCENE-3425:


OK, thanks for the explanation; now I understand AverageMergePolicy's purpose, 
and it makes sense.  It's ironic that a fully optimized index is the worst 
thing you could do when searching segments concurrently ...

But, I still don't understand why AverageMergePolicy is not merging the little 
segments from NRTCachingDir.  Do you tell it to target a maximum number of 
segments in the index?  If so, once the index is large enough, it seems like 
that'd force the small segments to be merged.  Maybe, you could also tell it a 
minimum size of the segments, so that it would merge away any segments still 
held in NRTCachingDir?

 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Shay Banon
 Fix For: 5.0, 4.5


 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-13 Thread caviler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766169#comment-13766169
 ] 

caviler commented on LUCENE-3425:
-

Yes, i changed the default MergePolicy to AverageMergePolicy, i closed old 
reader after reopenIfChanged,

before upgrade to Lucene 4.4.0, we use this AverageMergePolicy is work well, 
recently we upgrade to Lucene 4.4.0 and modified this AverageMergePolicy to 
adapt version 4.4.0, after this, in testing(in very short time we add a lot of 
documents), we noticed the small segments is so many(more than 40,000), and 
those small segments seems not have chance to be merged?

What difference between version 3.6.2 and 4.4.0 behavior in merge process???


our index files about 72G, split to 148 segment by AverageMergePolicy, every 
segment about 500M size.


{code}
public class AverageMergePolicy extends MergePolicy
{
/**
 * Default noCFSRatio. If a merge's size is = 10% of the index, then we
 * disable compound file for it.
 * 
 * @see #setNoCFSRatio
 */
public static final double DEFAULT_NO_CFS_RATIO = 0.1;

private long   maxSegmentSizeMB = 100L;// 

protected double   noCFSRatio   = DEFAULT_NO_CFS_RATIO;

private booleanpartialExpunge   = false;   // 

protected boolean  useCompoundFile  = true;

public AverageMergePolicy()
{
}

@Override
public void close()
{
}

@Override
public MergeSpecification findForcedDeletesMerges(final SegmentInfos infos) 
throws CorruptIndexException,
IOException
{
try
{

// 
SegmentInfoPerCommit best = null;

final int numSegs = infos.size();
for (int i = 0; i  numSegs; i++)
{
final SegmentInfoPerCommit info = infos.info(i);
if (info.hasDeletions())
{
if (null == best || info.getDelCount()  best.getDelCount())
{
best = info;
}
}
}

final CollectionSegmentInfoPerCommit mergingSegments = 
writer.get().getMergingSegments();

if (mergingSegments.contains(best))
{
return null; // skip merging segment
}

final MergeSpecification spec = new MergeSpecification();
if (null != best)
{
spec.add(new OneMerge(Collections.singletonList(best)));
}

return spec;
}
catch (final Throwable e)
{
e.printStackTrace();
return null;
}
}

@Override
public MergeSpecification findForcedMerges(final SegmentInfos infos,
   final int maxNumSegments,
   final MapSegmentInfoPerCommit, 
Boolean segmentsToMerge)
throws IOException
{
return findMerges(MergeTrigger.EXPLICIT, infos);
}

@Override
public MergeSpecification findMerges(final MergeTrigger mergeTrigger, final 
SegmentInfos infos) throws IOException
{
// partialExpunge = false; //
// partialExpunge = true;

final long maxSegSize = maxSegmentSizeMB * 1024L * 1024L; // 
long bestSegSize = maxSegSize; // 

try
{
final int numSegs = infos.size();

int numBestSegs = numSegs;
{
// 
SegmentInfoPerCommit info;
long totalSegSize = 0;

// compute the total size of segments
for (int i = 0; i  numSegs; i++)
{
info = infos.info(i);
final long size = size(info);
totalSegSize += size;
}

numBestSegs = (int) ((totalSegSize + bestSegSize - 1) / 
bestSegSize); // 

bestSegSize = (numBestSegs == 0) ? totalSegSize : (totalSegSize 
+ maxSegSize - 1) / numBestSegs; // 
if (bestSegSize  maxSegSize)
{
bestSegSize = maxSegSize; // 
numBestSegs = (int) ((totalSegSize + maxSegSize - 1) / 
bestSegSize);
}
}

MergeSpecification spec = findOneMerge(infos, bestSegSize);

//int branch = 0;
if (null == spec  partialExpunge)
{
//branch = 1;
// 
final OneMerge expunge = findOneSegmentToExpunge(infos, 0);
if (expunge != null)
{
spec = new MergeSpecification();
spec.add(expunge);
}
}

// MergeLogger.collect(branch, spec);


[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-13 Thread caviler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766430#comment-13766430
 ] 

caviler commented on LUCENE-3425:
-

Because inside IndexSearcher.search method, it use thread pool to execute 
Segment.search in every segments,

so, if some segment is too big,  it will causes this big segment's searcher too 
slow, eventually entire search method will too slow.

example, we have 2G index files.

=
use AverageMergePolicy, IndexSearcher.search spent time = 1s

segment sizesegment.search spent time
  1 500M  1s
  2 500M  1s
  3 500M  1s
  4 500M  1s

=
use other MergePolicy, IndexSearcher.search spent time = 5s
segment size   segment.search spent time
  1 2000M5s

=

Why not use LogByteSizeMergePolicy but AverageMergePolicy?

Because:
1. I want every semgent as small as possible!
2. I want semgent as more as possible!

we don't known how big of one segment size when entire index is growing, so 
can't use LogByteSizeMergePolicy.

if use LogByteSizeMergePolicy and setMaxMergeMB(200M):

index =   200M, LogByteSizeMergePolicy =   1 segment(per 200M),  
AverageMergePolicy = 4 segments(per 50M) 
index = 2000M, LogByteSizeMergePolicy = 10 segment(per 200M),  
AverageMergePolicy = 4 segments(per 500M) 








 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Shay Banon
 Fix For: 5.0, 4.5


 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766406#comment-13766406
 ] 

Michael McCandless commented on LUCENE-3425:


Hmm, I don't understand what AverageMergePolicy is doing?  Can you describe its 
purpose at a high level?  Somehow it's failing to merge those 40,000 small 
segments?

And offhand I don't know what changed between 4.3 and 4.4 that would cause 
AverageMergePolicy to stop merging small segments.

Maybe turn on IndexWriter's infoStream and watch which merges are being 
selected.

 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Shay Banon
 Fix For: 5.0, 4.5


 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-12 Thread caviler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765217#comment-13765217
 ] 

caviler commented on LUCENE-3425:
-

Actually, in my case, using NRTCachingDir always causes OutOfMemory error!
after it happened, we analysed a gc.hprof file by Eclipse MAT, we believe that:
core segment reader will hold a big size of RAMFile instance that created by 
NRTCachingDir.createOutput,
if core segment reader not merged or closed, those RAMFile instances will use 
more and more buffer, 
but never have chance to uncache from NRTCachingDir instance,
so, when all segment reader allocated buffer size is bigger than we have, it 
will causes OutOfMemory error!

 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Shay Banon
 Fix For: 5.0, 4.5


 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2013-09-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765336#comment-13765336
 ] 

Michael McCandless commented on LUCENE-3425:


It's true that open SegmentReaders will hold onto those RAMFiles.

However, those segments should be the smallish ones and they should be merged 
away.

It's odd that in your case you see them lasting so long.

Can you give more details about how you're using it?  Have you changed the 
default MergePolicy/settings?  Are you closing old readers after reopening new 
ones?

 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Shay Banon
 Fix For: 5.0, 4.5


 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2011-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13103723#comment-13103723
 ] 

Michael McCandless commented on LUCENE-3425:


Actually, NRTCachingDir does explicitly control the RAM usage in that
if its cache is using too much RAM then the next createOutput will go
straight to disk.

The one thing it does not do is evict the created files after they
close.  So, if you flush a big segment in IW, then NRTCachingDir will
keep those files in RAM even though its now over-budget.  (But the
next segment to flush will go straight to disk).

I think this isn't that big a problem in practice; ie, as long as you
set your IW RAM buffer to something not too large, or you ensure you
are opening a new NRT reader often enough that the accumulated docs
won't create a very large segment, then the excess RAM used by
NRTCachingDir will be bounded.

Still it would be nice to fix it so it evicts the files that set it
over, such that it's always below the budget once the outputs is
closed.  And I agree we should make it possible to have a single pool
for accounting purposes, so you can share this pool across multiple
NRTCachingDirs (and other things that use RAM).


 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Shay Banon

 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3425) NRT Caching Dir to allow for exact memory usage, better buffer allocation and global cross indices control

2011-09-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101567#comment-13101567
 ] 

Michael McCandless commented on LUCENE-3425:


Also, a quick win on trunk is to use IOCtx's FlushInfo.estimatedSegmentSize to 
decide up front whether to try caching or not.

Ie if the to-be-flushed segment is too large we should not cache it.


 NRT Caching Dir to allow for exact memory usage, better buffer allocation and 
 global cross indices control
 

 Key: LUCENE-3425
 URL: https://issues.apache.org/jira/browse/LUCENE-3425
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Reporter: Shay Banon

 A discussion on IRC raised several improvements that can be made to NRT 
 caching dir. Some of the problems it currently has are:
 1. Not explicitly controlling the memory usage, which can result in overusing 
 memory (for example, large new segments being committed because refreshing is 
 too far behind).
 2. Heap fragmentation because of constant allocation of (probably promoted to 
 old gen) byte buffers.
 3. Not being able to control the memory usage across indices for multi index 
 usage within a single JVM.
 A suggested solution (which still needs to be ironed out) is to have a 
 BufferAllocator that controls allocation of byte[], and allow to return 
 unused byte[] to it. It will have a cap on the size of memory it allows to be 
 allocated.
 The NRT caching dir will use the allocator, which can either be provided (for 
 usage across several indices) or created internally. The caching dir will 
 also create a wrapped IndexOutput, that will flush to the main dir if the 
 allocator can no longer provide byte[] (exhausted).
 When a file is flushed from the cache to the main directory, it will return 
 all the currently allocated byte[] to the BufferAllocator to be reused by 
 other files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org