A couple more points which I think you alluded to, Josh, but I would still like to call out: 1) Writing of these query logs to a phoenix table should be best effort i.e. a query definitely shouldn't fail because we encountered an issue while writing its log 2) Writing of query logs should happen in a manner that is async to the flow of the query i.e. a query shouldn't incur the cost of the write happening to the query log table
Doing 2) will help out with 1) On Fri, Mar 2, 2018 at 2:28 PM, Josh Elser <[email protected]> wrote: > Thanks Nick and Andrew! These are great points. > > * A TTL out of the box is a must. That's such a good suggestion > * Sensitivity of data being stored is also a tricky-serious issue to > consider. We'll want to lock the table down and be able to state very > clearly what data may show up in it. > * I like the "levels" of detail that will be persisted. It will help break > up the development work (e.g. first impl can just be the INFO details), and > prevents concern of runtime impact. > * Sampling is a no-brainer for "always-on" situations. I like that too. > > I'll work on taking these (and others) and updating the gdoc tonight. > Thanks again for your feedback! > > > On 3/2/18 1:50 PM, Andrew Purtell wrote: > >> Agree with Nick's points but let me augment with an additional suggestion: >> Tunable/configurable threshold for sampling. In many cases it's sufficient >> to sample e.g. 1% of queries to get sufficient coverage and this would >> prune 99% of actual load from the query log. >> >> Also let me underline that compliance requirements will require either >> super strong controls of the query log if everything is always logged, in >> which case it is important that it works well with access control features >> to lock it down; or better what Nick suggests where we can turn off things >> like logging the values supplied for bound parameters. >> >> >> >> On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <[email protected]> wrote: >> >> I'm a big fan of this idea. There was a brief discussion on the topic over >>> on PHOENIX-2715. >>> >>> My first concern is that the collected information is huge -- easily far >>> larger than the user data for a busy cluster. For instance, a couple 10's >>> of GB stored user data, guideposts set to default 100mb, enable salting >>> on >>> a table with an "innocent" value of 10 or 20 and the collection of RPCs >>> can >>> easily grow into the hundreds for simple queries. Even if you catalog >>> just >>> the "logical" RPC's - HBase Client API calls that Phoenix plans rather >>> than >>> the underlying HBase Client RPCs - this will be quite large. The >>> guidepost >>> themselves for such a table would be on the order of 30mb. >>> >>> My next concern is about the sensitive query parameters being stored. >>> It's >>> entirely reasonable to expect a table to store sensitive information that >>> should not be exposed to operations. >>> >>> Thus, my suggestions: >>> * minimize the unbounded nature of this table by truncating all columns >>> to >>> some max length -- perhaps 5k or 10k. >>> * enable a default TTL on the schema. 7 days seems like a good starting >>> point. >>> * consider controlling which columns are populated via some operational >>> mechanism. Use Logger level as an example, with INFO the default setting. >>> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe >>> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and >>> scan metrics. TRACE adds RPCs and timing, snapshot metadata. >>> >>> Thanks, >>> Nick >>> >>> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <[email protected]> wrote: >>> >>> Hiya, >>>> >>>> I wanted to share this little design doc with you about some feature >>>> work >>>> we've been thinking about. The following is a Google doc in which anyone >>>> should be allowed to comment. Feel free to comment there, or here on the >>>> thread. >>>> >>>> https://s.apache.org/phoenix-query-log >>>> >>>> The high-level goal is to create a construct in which Phoenix clients >>>> >>> will >>> >>>> automatically serialize information about the queries they run to a >>>> table >>>> for retrospective analysis. Ideally, this information would be stored in >>>> >>> a >>> >>>> Phoenix table. We want this data to help answer questions like: >>>> >>>> * What queries are running against my system >>>> * What specific queries started between 535AM and 620AM two days ago >>>> * What queries are user "bob" running >>>> * Are my user's queries effectively using the indexes in the system >>>> >>>> Anti-goals for include: >>>> >>>> * Cluster impact (computation/memory) usage of a query >>>> * Query performance may be slowed to ensure all data is serialized >>>> * A third-party service dedicated to ensuring query info is serialized >>>> >>> (in >>> >>>> the event of client failure) >>>> >>>> Take a look at the document and let us know what you think please. I'm >>>> happy to try to explain this in greater detail. >>>> >>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey) >>>> >>>> >>> >> >> >>
