Re: [DISCUSS] Explore Ways to Improve Query Execution Visibility

Kadir Ozdemir Wed, 11 Oct 2023 23:54:34 -0700

Istvan,

Server Paging <https://issues.apache.org/jira/browse/PHOENIX-6211> is for
preventing scans from overloading a cluster. It is designed to allow long
running queries to run without causing RPC timeout issues as well as
resource starvation. It implements end-to-end query pacing where server
side operations are broken into time-bound slices (i.e., pages) to
eliminate timeouts and to improve time sharing among queries, and overall
system availability. It forces the server to generate a result for each
page. For example, if the page size is 3 ms then the server paging feature
generates a result within 3 seconds. If the server is not ready to generate
a valid result then a dummy result is generated. The server buffers the
generated results and then returns them back to the client.


The HBase server decides to return buffered results to the client by
checking the number of buffered results, the total byte size of the
buffered results, and the RPC call time. All these are configurable. By
setting the paging size to a small value say 3 seconds, and the other
parameters to reasonable values accordingly, say the number of results to
100 and the RPC timeout to say 20 seconds, it would be guaranteed that
every 10 seconds (i.e., half of RPC timeout) or every 100 results, the
server will generate the RPC response.

The Phoenix client (more specifically ScanningResultIterator) will drop the
dummy results and issue a new next() call. This means a new RPC request
will be queued for this scan. This will allow the other scans to make
progress as well. If a Phoenix query times out by the Phoenix client or
terminated by the application, ScanningResultIterator detects this when it
receives a dummy result from the server and closes the client site result
scanner (PHOENIX-6918 <https://issues.apache.org/jira/browse/PHOENIX-6918>).
This allows us to terminate these types of queries timely on the server
side.

PHOENIX-7024 <https://issues.apache.org/jira/browse/PHOENIX-7024> hardened
Phoenix server paging recently. The remaining question is if we would see
the issue of "queries overloading the cluster" when the server paging
feature is configured properly. I am writing this to make sure that any
query execution management and monitoring should take advantage of the
server paging feature.


Thanks,

Kadir

On Wed, Oct 11, 2023 at 10:05 PM Istvan Toth <st...@apache.org> wrote:

> Hi!
>
> We are currently looking into how we could improve query execution
> visibility.
> The primary problem we are trying to solve is identifying and
> possibly killing queries that overload the cluster.
>
> Migrating Phoenix to OpenTelemetry is a no-brainer, and I am working on
> that now.
>
> We are also exploring ways we could id scans and tie them back to
> individual queries that do not require tracing, along the lines of
> PHOENIX-5974 <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PHOENIX-5974__;!!DCbAVzZNrAf4!FWE_-Fg01RNUxjqkw74YfK5udU1_Qudrq9EXJC7qfrBUAdES-8vwJ7ub1GlCtf1Pyd6etayAr4q7gBiF$
> > and
> PHOENIX-7038 <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PHOENIX-7038__;!!DCbAVzZNrAf4!FWE_-Fg01RNUxjqkw74YfK5udU1_Qudrq9EXJC7qfrBUAdES-8vwJ7ub1GlCtf1Pyd6etayAr7Zs59dC$
> >.
>
> We are mostly in the brainstorming phase, but as this is a recurring
> concern, I think it's better to include the community early.
> While most of this has been discussed before, it may be useful to have a
> big picture of the issues and identify the best way to improve the
> situation, or find existing solutions that we overlooked.
>
> This is what we have at the moment:
> (Would you prefer that I share this as a google doc, or is it better to
> keep everything directly on the list ?)
> Identifying runaway Phoenix queriesGeneral remarks
>
>    -
>
>    Queries only exist in the Phoenix thick client.
>    -
>
>    The RegionServers only see scans.
>    -
>
>    There is a JIRA <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PHOENIX-5974__;!!DCbAVzZNrAf4!FWE_-Fg01RNUxjqkw74YfK5udU1_Qudrq9EXJC7qfrBUAdES-8vwJ7ub1GlCtf1Pyd6etayAr4q7gBiF$
> > for
>    adding and propagating a queryId which can be used to correlate scans to
>    queries.
>    -
>
>       There is a partial uncommitted PR for this
>       -
>
>       This needs to be written/tested/fleshed out
>       -
>
>       Would a separate connection ID be useful ?
>       -
>
>       How do we set and/or expose the IDs via JDBC ?
>       -
>
>       How do we propagate this via Avatica/PQS ?
>       -
>
>    Phoenix cannot cancel running queries, the only solution is killing the
>    JVM (implement ?)
>    -
>
>    Killing the JVM does not immediately cancel scans.
>    -
>
>       Need to investigate if the current scan keepalive feature is
>       sufficient to kill scans belonging to killed queries / JVMs
>       -
>
>    The existing Hbase request throttling feature, and setting a low pool
>    size can mitigate, but those also affect legitimate queries.
>
> Use case
>
>    -
>
>    What kinds of queries are we addressing ?
>    -
>
>       Ad-hoc queries issued via Hue, or a limited amount of workstations
>       -
>
>       Queries issued by distributed applications or MR/Spark/Hive jobs
>
>
> Generally, ad-hoc queries are easier to monitor, as they are coming from a
> few clients.
>
> Identifying queries generating overload
>
>    -
>
>    Hbase metrics
>    -
>
>       only includes aggregate data on scans
>       -
>
>       does not include per scan data
>       -
>
>    Tracing
>    -
>
>       Currently not working in Phoenix PHOENIX-5215
>       <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PHOENIX-5215__;!!DCbAVzZNrAf4!FWE_-Fg01RNUxjqkw74YfK5udU1_Qudrq9EXJC7qfrBUAdES-8vwJ7ub1GlCtf1Pyd6etayAr_GiVnMt$
> > (on roadmap)
>       -
>
>       Needs trace collection infrastructure
>       -
>
>       How does it scale ?
>       -
>
>    Phoenix Query Logging into SYSTEM.LOG table
>    -
>
>       Works
>       -
>
>       Can identify slow queries
>       -
>
>       Can sample
>       -
>
>       Logs execution plan
>       -
>
>       Does not log individual scans
>       -
>
>       Has performance/cluster load cost
>       -
>
>    Custom
>    -
>
>       Expose data on running scans from RSs
>       -
>
>          runtime
>          -
>
>          memory ? Do we even have that info ?
>          -
>
>          cpu ? Do we have that info ?
>          -
>
>          client info
>          -
>
>             hostname
>             -
>
>             port
>             -
>
>             ???
>             -
>
>          Without PHOENIX-5974
>          <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PHOENIX-5974__;!!DCbAVzZNrAf4!FWE_-Fg01RNUxjqkw74YfK5udU1_Qudrq9EXJC7qfrBUAdES-8vwJ7ub1GlCtf1Pyd6etayAr4q7gBiF$
> > this cannot
>          easily be correlated to individual queries
>          -
>
>          New RPC call ?
>          -
>
>       Expose from hbase shell command ?
>       -
>
>       Expose on RS Web UI ?
>       -
>
>       Expose in hbase-top ?
>       -
>
>    RS Log analysis ?
>

Re: [DISCUSS] Explore Ways to Improve Query Execution Visibility

Reply via email to