> We have many HQL scripts that select from nested sub-selects. In many
>cases the nesting can be a few levels deep:
...
> Such queries are difficult to test as they are complex monoliths. While
>we can decompose and modularise them at the column level with UDFs and
>macros, it is not as evident to me how best to break up the nested
>queries into separate components. I consider the options to be:
...
> * What is considered the best practice for modularising this type of
>query?
> * Is there a penalty for using views over a fully inlines query?
> * Are there any other options that I haven't considered?

I prefer using the CTE expressions, which are a SQL standard way to do
this.

The approaches really depend on your hive version & hadoop version.

Just like views, CTEs aren't materialized, but are rolled into the query
and duplicated for each invocation which might not be ideal for you.

But that is no different from repeating the sub-query manually and is much
cleaner.

So for a CTE referred to in many queries, I prefer prefixing my query
fragments with

set hive.exec.temporary.table.storage=memory;


create temporary table if not exists q_monthly_aggregate_2014 ...

when running a large number of them in the same session, that works much
better (ughly cubing).

The CTEs and views are duplicated wherever they are referred, because that
lets you
column prune or push in table filters into the ORC layers. A 10 column
view where you read 1 column out
will only read that one column in the stage reading it.

But the hive CBO team is working on fixing the general case of that, so
that repetitions can identified by filters & spooled (via Apache Calcite) -
https://issues.apache.org/jira/browse/CALCITE-481

There's also a patch from Navis recently to work around this temporarily -
https://issues.apache.org/jira/browse/HIVE-11752

Cheers,
Gopal


Reply via email to