Hi!

We are currently looking into how we could improve query execution
visibility.
The primary problem we are trying to solve is identifying and
possibly killing queries that overload the cluster.

Migrating Phoenix to OpenTelemetry is a no-brainer, and I am working on
that now.

We are also exploring ways we could id scans and tie them back to
individual queries that do not require tracing, along the lines of
PHOENIX-5974 <https://issues.apache.org/jira/browse/PHOENIX-5974> and
PHOENIX-7038 <https://issues.apache.org/jira/browse/PHOENIX-7038>.

We are mostly in the brainstorming phase, but as this is a recurring
concern, I think it's better to include the community early.
While most of this has been discussed before, it may be useful to have a
big picture of the issues and identify the best way to improve the
situation, or find existing solutions that we overlooked.

This is what we have at the moment:
(Would you prefer that I share this as a google doc, or is it better to
keep everything directly on the list ?)
Identifying runaway Phoenix queriesGeneral remarks

   -

   Queries only exist in the Phoenix thick client.
   -

   The RegionServers only see scans.
   -

   There is a JIRA <https://issues.apache.org/jira/browse/PHOENIX-5974> for
   adding and propagating a queryId which can be used to correlate scans to
   queries.
   -

      There is a partial uncommitted PR for this
      -

      This needs to be written/tested/fleshed out
      -

      Would a separate connection ID be useful ?
      -

      How do we set and/or expose the IDs via JDBC ?
      -

      How do we propagate this via Avatica/PQS ?
      -

   Phoenix cannot cancel running queries, the only solution is killing the
   JVM (implement ?)
   -

   Killing the JVM does not immediately cancel scans.
   -

      Need to investigate if the current scan keepalive feature is
      sufficient to kill scans belonging to killed queries / JVMs
      -

   The existing Hbase request throttling feature, and setting a low pool
   size can mitigate, but those also affect legitimate queries.

Use case

   -

   What kinds of queries are we addressing ?
   -

      Ad-hoc queries issued via Hue, or a limited amount of workstations
      -

      Queries issued by distributed applications or MR/Spark/Hive jobs


Generally, ad-hoc queries are easier to monitor, as they are coming from a
few clients.

Identifying queries generating overload

   -

   Hbase metrics
   -

      only includes aggregate data on scans
      -

      does not include per scan data
      -

   Tracing
   -

      Currently not working in Phoenix PHOENIX-5215
      <https://issues.apache.org/jira/browse/PHOENIX-5215> (on roadmap)
      -

      Needs trace collection infrastructure
      -

      How does it scale ?
      -

   Phoenix Query Logging into SYSTEM.LOG table
   -

      Works
      -

      Can identify slow queries
      -

      Can sample
      -

      Logs execution plan
      -

      Does not log individual scans
      -

      Has performance/cluster load cost
      -

   Custom
   -

      Expose data on running scans from RSs
      -

         runtime
         -

         memory ? Do we even have that info ?
         -

         cpu ? Do we have that info ?
         -

         client info
         -

            hostname
            -

            port
            -

            ???
            -

         Without PHOENIX-5974
         <https://issues.apache.org/jira/browse/PHOENIX-5974> this cannot
         easily be correlated to individual queries
         -

         New RPC call ?
         -

      Expose from hbase shell command ?
      -

      Expose on RS Web UI ?
      -

      Expose in hbase-top ?
      -

   RS Log analysis ?

Reply via email to