[ https://issues.apache.org/jira/browse/PHOENIX-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370798#comment-14370798 ]
James Taylor edited comment on PHOENIX-1304 at 3/20/15 5:57 AM: ---------------------------------------------------------------- [~samarthjain] - I think the idea is good, but the implementation can be quite a bit simpler. I don't think you need to track region servers at all and the logic can be completely isolated to BaseQueryPlan.iterator(final List<? extends SQLCloseable> dependencies): - Add a member variable to BaseQueryPlan for iterators (either ParallelIterators or SerialIterators) - If we're running serially or we're doing a skip scan, don't bother checking stats or setting no_cache hint. - Otherwise estimate the bytes traversed given iterators.getSplits(). You can get an estimate of the guidepost width by getting the GuidePostsInfo for the empty column family and the guidePostsInfo.getByteCount() / guidePostInfo.getGuidePosts().size(). If you multiply this by the iterators.getSplits().size(), that's the approximate number of bytes traversed. - Finally, based on if the total bytes traversed exceeds your config (which we may want to default if not set to the block size cache), then set the no_cache value right on the scan here in the iterator method. was (Author: jamestaylor): [~samarthjain] - I think the idea is good, but the implementation can be quite a bit simpler. I don't think you need to track region servers at all and the logic can be completely isolated to BaseQueryPlan.iterator(final List<? extends SQLCloseable> dependencies): - Add a member variable to BaseQueryPlan for iterators (either ParallelIterators or SerialIterators) - If we're running serially or we're doing a skip scan, don't bother checking stats or setting no_cache hint. - Otherwise estimate the bytes traversed given iterators.getSplits(). You can get an estimate of the guidepost width by getting the GuidePostsInfo for the empty column family and the guidePostsInfo.getByteCount() / guidePostInfo.getGuidePosts().size(). If you multiply this by the iterators.getSplits().size(), that's the approximate number of bytes traversed. > Auto-detect if we should pass the NO_CACHE hint > ----------------------------------------------- > > Key: PHOENIX-1304 > URL: https://issues.apache.org/jira/browse/PHOENIX-1304 > Project: Phoenix > Issue Type: Improvement > Reporter: Lars Hofhansl > Assignee: Samarth Jain > Priority: Minor > Attachments: wip.patch > > > Most databases by default avoid filling the block cache during full scans. > Typically either stats are consulted to decide whether a full scan should > fill the blockcache, or a subset of the block cache is dedicated to full scan > using the cache like a ring buffer. > We already have the "NO_CACHE" hint, but we can do better. > In Phoenix we could detect scans that neither use any parts of the key nor > any indexes and then optionally: > # avoid using the blockcache > # throw a "slow query" exception (this is especially useful for large data > set, where we'd rather fail than go into a nirvana for an hour) > (both configurable - either globally or per table or connection or query) > Skip scans represent an interesting middle ground. If we skip many blocks > between rows we'd definitely benefit from the blockcache, if not we have a > case similar to a full scan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)