[ https://issues.apache.org/jira/browse/LUCENE-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267176#comment-15267176 ]
Jeff Wartes commented on LUCENE-7258: ------------------------------------- Ok, yeah, that’s a reasonable thing to assume. We usually think of it in terms of cpu work, but filter caches would be an equally great way to mitigate allocations. But a cache is really only useful when you’ve got non-uniform query distributions, or enough time-locality at your query rate that your rare queries haven’t faced a cache eviction yet. I’m indexing address-type data. Not uncommon. I think that if my typical geospatial search were based on some hyper-local phone location, we’d be done talking, since a filter cache would be useless. So maybe we should assume I’m not doing that. Let’s assume I can get away with something coarse. Let’s assume I can convert all location based queries to the center point of a city. Let’s further assume that I only care about one radius per city. Finally, let’s assume I’m only searching in the US. There are some 40,000 cities in the US, so those assumptions yield 40,000 possible queries. That’s not too bad. With a 100M-doc core, I think that’s about 12.5Mb per filter cache entry. It could be less, I think, particularly with the changes in SOLR-8922, but since we’re only going with coarse queries, it’s reasonable to assume there’s going to be a lot of hits. I don’t need every city in the cache, of course, so maybe… 5%? That’s only some 25G of heap. Doable, especially since it saves allocation size and you could probably trade in more of the eden space. (Although this would make warmup more of a pain) I’d probably have to cross the CompressedOops boundary at 32G of heap to do that too though, so add another 16G to get back to baseline. Fortunately, the top 5% of cities probably maps to more than 5% of queries. More populated cities are also more likely targets for searching in most query corpuses. So assuming it’s the biggest 5% that are in the cache, maybe we can assume a 15% hit rate? 20%? Ok, so now I’ve spent something like 41G of heap, and I’ve reduced allocations by 20%. Is this pretty good? I suppose it’s worth noting that this also assumes a perfect cache eviction policy, (I’m pretty interested in SOLR-8241) and that there’s no other filter cache pressure. (At the least, I’m using facets - SOLR-8171) > Tune DocIdSetBuilder allocation rate > ------------------------------------ > > Key: LUCENE-7258 > URL: https://issues.apache.org/jira/browse/LUCENE-7258 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/spatial > Reporter: Jeff Wartes > Attachments: > LUCENE-7258-Tune-memory-allocation-rate-for-Intersec.patch, > LUCENE-7258-Tune-memory-allocation-rate-for-Intersec.patch, > allocation_plot.jpg > > > LUCENE-7211 converted IntersectsPrefixTreeQuery to use DocIdSetBuilder, but > didn't actually reduce garbage generation for my Solr index. > Since something like 40% of my garbage (by space) is now attributed to > DocIdSetBuilder.growBuffer, I charted a few different allocation strategies > to see if I could tune things more. > See here: http://i.imgur.com/7sXLAYv.jpg > The jump-then-flatline at the right would be where DocIdSetBuilder gives up > and allocates a FixedBitSet for a 100M-doc index. (The 1M-doc index > curve/cutoff looked similar) > Perhaps unsurprisingly, the 1/8th growth factor in ArrayUtil.oversize is > terrible from an allocation standpoint if you're doing a lot of expansions, > and is especially terrible when used to build a short-lived data structure > like this one. > By the time it goes with the FBS, it's allocated around twice as much memory > for the buffer as it would have needed for just the FBS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org