[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436978#comment-16436978 ]
Maryann Xue commented on PHOENIX-4666: -------------------------------------- Thank you very much for your work, [~ortutay]! Here's a few things on the high-level: 1. First of all, I think it's important that we have an option to enable and disable the persistent cache, making sure that users can still run join queries in the default temp-cache way. 2. Regarding to your change [2], can you explain what exactly is the problem of key-range generation? Looks like checkCache() and addCache() are doing redundant work, and CachedSubqueryResultIterator should be unnecessary. We do not wish to read the cache on the client side and then re-add the cache again. 3. We need to be aware that the string representation of the sub-query statement is not reliable, which means the same join-tables or sub-queries do not necessarily map to the same string representation, and thus will have different generated cache-id. It'd be optimal if we can have some normalization here. We can consider leaving this as a future improvement, yet at this point we'd better have some test cases (counter cases as well) to cover this point. 4. Is there a way for us to update the cache content if tables have been updated? This might be related to what approach we take to add and re-validate cache in (2). 5. A rather minor point as it just occurred to me: Can we have CacheEntry implement Closable? Lastly, I understand that it's work in progress, but as we move on, could you please do a little clean-up so it would be easier for discussions and code reviews? For example, correct the indentation (make sure there's no tabs); instead of commenting out a line of code, can you just remove it; and get rid of all "system.out.println" or replace them with logging if necessary? > Add a subquery cache that persists beyond the life of a query > ------------------------------------------------------------- > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement > Reporter: Marcell Ortutay > Assignee: Marcell Ortutay > Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > ---- > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)