Re: SQL Query Set Analyzer

Julian Hyde Mon, 06 Aug 2018 19:13:50 -0700

Regarding the SCOPE paper you reference. That was on my mind too (I went to the 
talk at SIGMOD). A materialized view is created only if the same query is used 
*textually identically* in different parts of the ETL process, so it is mainly 
for optimizing batch jobs that are largely the same night after night. Lattices 
are a better approach for optimizing interactive BI work-loads.


Julian


> On Aug 6, 2018, at 4:57 PM, Jesus Camacho Rodriguez 
> <jcamachorodrig...@hortonworks.com> wrote:
> 
> You can find an overview of the work that has been done in Hive for
> materialized view integration in the following link:
> https://cwiki.apache.org/confluence/display/Hive/Materialized+views 
> <https://cwiki.apache.org/confluence/display/Hive/Materialized+views>
> Materialized views can be stored in external tables such as Druid-backed
> tables too. Druid rules that in Calcite are used to push computation
> to Druid from Hive.
> 
> The rewriting algorithm itself is in Calcite. The algorithm can take advantage
> of constraints (PK-FK relationship between tables) to produce additional
> correct rewritings, can execute rollups, etc. However, it does not assume any
> specific schema layout, which may make it useful for multiple ETL workloads.
> http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
>  
> <http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information>
> The most recent addition is the support for partitioned materialized views,
> including the extension in the cost model to take into account partition 
> pruning
> during the planning phase.
> 
> Incremental maintenance is supported. Most of that code lives in Hive, but it 
> relies
> on the rewriting algorithm too. It only works for materialized views that use 
> Hive
> transactional tables, either full ACID or insert-only. Basically Hive exposes 
> explicitly
> the data contained in the materialization via filter condition, e.g., mv1 
> contains data
> for transactions (x, y, z), then let the rewriting algorithm trigger a 
> partial rewriting
> which reads new contents from the sources tables and processed contents from 
> mv1.
> Finally, an additional step transforms the rewritten expression into an 
> INSERT or
> MERGE statement depending on the materialized view expression (MERGE for
> materialized views containing aggregations). Since not all tables in Hive 
> support
> UPDATE needed for MERGE, we were thinking about allowing some target 
> materialized
> views with definitions that include aggregates to use INSERT and then force 
> the rollup
> at runtime, e.g., for Druid.
> bq. Maybe it depends on the aggregation functions that are used?
> The result of some aggregate functions cannot be (always) incrementally 
> maintained in
> the presence of UPDATE/DELETE operations on source tables, e.g., min and max, 
> though
> some rewriting to minimize full rebuilds can be used if count is added as an 
> additional
> column to the materialized view. Incremental maintenance in presence of 
> UPDATE/DELETE
> operations in source tables is not supported in Hive yet, hence this is not 
> implemented.
> 
> 
> I would like to think that of the problems described below, we are getting to 
> the
> 'more interesting stuff' in the Hive project, though there is some 
> consolidation needed for
> existing work too. That is why we are also interested in any effort related 
> to materializations
> recommendation. I believe the most powerful abstraction to use would be 
> RelNode, which
> can be useful for any system representing its queries internally using that 
> representation,
> instead of relying on SQL nodes which are more closely tight to the parser.
> 
> Concerning the ´feedback loop´, this recent paper by MSFT describes a system 
> that does
> something similar to what James was describing (for SCOPE):
> https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf
>  
> <https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf>
> 
> -Jesús
> 
> 
> 
> On 8/6/18, 3:32 PM, "Julian Hyde" <jh...@apache.org 
> <mailto:jh...@apache.org>> wrote:
> 
>    It’s hard to automatically recommend a set of MVs from past queries. The 
> design space is just too large. But if you are designing MVs for interactive 
> BI, you can use the “lattice” model. This works because many queries will be 
> filter-join-aggregate queries on a star schema (i.e. a central fact table and 
> dimension tables joined over many-to-one relationships). (Or perhaps a join 
> between two or more such queries.) 
> 
>    Do the queries you are trying to optimize have that pattern?
> 
>    If so, you might start by creating a lattice for each such star schema. 
> Then the lattice can suggest MVs that are summary tables.
> 
>    (Lattice suggester is one step more meta - it recommends lattices - but 
> given where you are, I would suggest hand-writing one or two lattices.)
> 
>    Calcite is a framework, and this unfortunately means that you have to 
> write Java code to use these features. It might be easier if you use the new 
> “server” module, which supports CREATE MATERIALIZED VIEW as a DDL statement. 
> Then you can create some demos for your colleagues that are wholly or mostly 
> SQL.
> 
>    The simplest way to populate a materialized view is the CREATE 
> MATERIALIZED VIEW statement. It basically does the same as CREATE TABLE AS 
> SELECT (executes a query, stores the results in a table) but it leaves behind 
> the metadata about where that data came from.
> 
>    Materialized views can in principle be maintained incrementally, but how 
> you do it depends upon what changes are allowed (append only? Replace rows 
> and write the old rows to an audit table?). We’ve not done a lot of work on 
> it. I believe the Hive folks have given this more thought than I have.
> 
>    Julian
> 
> 
>> On Aug 3, 2018, at 11:11 PM, James Taylor <jamestay...@apache.org> wrote:
>> 
>> Both the Lattice Suggestor and Quark sound like what I need for an
>> automated solution, but I have some more basic follow up questions first.
>> Here's our basic use case (very similar to Zheng Shao's, I believe):
>> - Our company has stood up Presto for data analysts
>> - Nightly ETL jobs populate Hive tables with lots of data
>> - Analysts run adhoc queries over data using Presto
>> - The top CPU using queries are pretty complex (2-3 pages of complex SQL,
>> lots of joins and aggregation)
>> 
>> There are some basic/obvious stuff that can be done manually first:
>> - Provide better visibility into which queries are expensive
>> - Ask query owners to produce their own materialized views and manually
>> change their queries to use them (I believe there's some amount of this
>> already)
>> 
>> Then there's kind of a middle ground:
>> - Ask query owners to identify what they think are the top few materialized
>> views to build
>> - Manually build these materialized views in the daily ETL job.
>> - Use Calcite to rewrite the query to use the materialized views. Can
>> Calcite do this and would it be a problem if the queries are Presto
>> queries? I'd need to make sure I provided Calcite with the cost information
>> it needs, right?
>> - Dark launch to test that the rewritten query returns the same results as
>> the original query (and measure the perf improvement)
>> 
>> But the more interesting stuff is:
>> - Automatically identifying the materialized views that should be built.
>> Sounds like both the Lattice Suggestor and Quark are potentially a good
>> fit. I'm not clear on what is output by the Suggestor. Would it spit out a
>> CREATE VIEW statement (or could what it outputs produce that)? How does the
>> Suggestor compare with Quark?
>> - Automatically build the materialized views. Would the Lattice framework
>> or Quark help me with that? Would it be possible to incrementally build the
>> materialized views or would it be necessary to build the materialized views
>> from the beginning of time again and again (clearly not feasible given the
>> size of the tables)? Maybe it depends on the aggregation functions that are
>> used?
>> - And the nirvana is a kind of feedback loop - based on the top N expensive
>> queries, identify and build the materialized views, use them transparently
>> during querying, and then retire them when they're infrequently used.
>> 
>> Would it be a better choice to build the materialized views as Druid
>> tables? That'd require a Druid connector to Presto, though. This reminds me
>> of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
>> but for Presto instead of Hive. Do you think any of that would transfer
>> over in some way?
>> 
>> WDYT? Huge amount of work? Any advice is much appreciated.
>> 
>> Thanks,
>> James
>> 
>> On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jh...@apache.org 
>> <mailto:jh...@apache.org><mailto:jh...@apache.org 
>> <mailto:jh...@apache.org>>> wrote:
>> 
>>> PS
>>> 
>>> +1 for Babel.
>>> 
>>> If you are analyzing a set of queries, it is very likely that these
>>> queries were written to be executed against another database. Babel aims to
>>> take such queries and convert them into Calcite relational algebra. The
>>> process might occasionally be lossy, if Calcite's algebra does not support
>>> a feature, but the algebra is still useful.
>>> 
>>>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <jh...@apache.org 
>>>> <mailto:jh...@apache.org>> wrote:
>>>> 
>>>> There are many possible analyzers, but Lattice Suggester is one that I
>>> am working on and is relatively mature. It looks at lots of queries and
>>> builds lattices (star schemas with measures) from those queries. It finds
>>> commonality by “growing” lattices - adding measures, adding derived
>>> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL
>>> query strings, then it parses them (to SqlNode), validates, and converts to
>>> relational algebra (RelNode). Then it looks for patterns in the relational
>>> algebra. Working at the algebra level as opposed to the SQL parse tree is a
>>> net benefit, but some things (e.g. figuring out the original column alias
>>> for an expression) are a bit more difficult.
>>>> 
>>>> Lattice suggester is under development in my
>>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester 
>>> <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester> <
>>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester 
>>> <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester><https://github.com/julianhyde/calcite/tree/1870-lattice-suggester
>>>  <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>>>
>>> branch. The tests pass, and I hope to have it merged into master in the
>>> next month or two.
>>>> 
>>>> I agree with Devjyoti that it’s difficult to gather together all
>>> possible analyses in one tool. Lattice Suggester is a good base for
>>> analyses that model queries as a filter/project/aggregate of a pre-joined
>>> star schema — a very common model in BI; for example, it could model which
>>> combinations of columns are commonly used as filters.
>>>> 
>>>> For analyses that are not tied to star schemas, feel free to create new
>>> tools. The tools would benefit from collaborative development, and I think
>>> that Calcite would be a good home for them.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyo...@qubole.com 
>>>>> <mailto:devjyo...@qubole.com>
>>> <mailto:devjyo...@qubole.com <mailto:devjyo...@qubole.com> 
>>> <mailto:devjyo...@qubole.com <mailto:devjyo...@qubole.com>>>> wrote:
>>>>> 
>>>>> Hi Zheng,
>>>>> 
>>>>> At Qubole, we are building something very similar to what you are
>>> looking
>>>>> for. And from experience, I can tell you that it is a lot easy to build
>>> it
>>>>> than what one may think.
>>>>> We use Calcite parser to parse the SQL into Sqlnode and then use
>>> different
>>>>> tree visitors to extract query attributes like  tables, filter columns,
>>>>> joins, subqueries etc.,
>>>>> 
>>>>> Our approach is very similar to Uber's QueryParser project (
>>>>> https://github.com/uber/queryparser <https://github.com/uber/queryparser> 
>>>>> <https://github.com/uber/queryparser 
>>>>> <https://github.com/uber/queryparser>> <https://github.com/uber/ 
>>>>> <https://github.com/uber/> <https://github.com/uber/ 
>>>>> <https://github.com/uber/>>
>>> queryparser> ), but we go deeper in our analysis of
>>>>> finding queries that are semantically similar to some canonicalized
>>> form.
>>>>> If you intend to begin from scratch, I can give you some pointers to get
>>>>> started.
>>>>> 
>>>>> Thanks,
>>>>> Devjyoti
>>>>> 
>>>>> 
>>>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zsh...@gmail.com 
>>>>> <mailto:zsh...@gmail.com><mailto:zsh...@gmail.com 
>>>>> <mailto:zsh...@gmail.com>> <mailto:
>>> zsh...@gmail.com>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> We are thinking about starting a project to analyze huge number of SQL
>>>>>> queries (think millions) to identify common patterns:
>>>>>> * Common sub queries
>>>>>> * Common filtering conditions (columns) for a table
>>>>>> * Common join keys for table pairs
>>>>>> 
>>>>>> Are there any existing projects on that direction using Calcite?  Would
>>>>>> love to leverage instead of building from scratch.
>>>>>> 
>>>>>> Zheng

Re: SQL Query Set Analyzer

Reply via email to