> We have many HQL scripts that select from nested sub-selects. In many >cases the nesting can be a few levels deep: ... > Such queries are difficult to test as they are complex monoliths. While >we can decompose and modularise them at the column level with UDFs and >macros, it is not as evident to me how best to break up the nested >queries into separate components. I consider the options to be: ... > * What is considered the best practice for modularising this type of >query? > * Is there a penalty for using views over a fully inlines query? > * Are there any other options that I haven't considered?
I prefer using the CTE expressions, which are a SQL standard way to do this. The approaches really depend on your hive version & hadoop version. Just like views, CTEs aren't materialized, but are rolled into the query and duplicated for each invocation which might not be ideal for you. But that is no different from repeating the sub-query manually and is much cleaner. So for a CTE referred to in many queries, I prefer prefixing my query fragments with set hive.exec.temporary.table.storage=memory; create temporary table if not exists q_monthly_aggregate_2014 ... when running a large number of them in the same session, that works much better (ughly cubing). The CTEs and views are duplicated wherever they are referred, because that lets you column prune or push in table filters into the ORC layers. A 10 column view where you read 1 column out will only read that one column in the stage reading it. But the hive CBO team is working on fixing the general case of that, so that repetitions can identified by filters & spooled (via Apache Calcite) - https://issues.apache.org/jira/browse/CALCITE-481 There's also a patch from Navis recently to work around this temporarily - https://issues.apache.org/jira/browse/HIVE-11752 Cheers, Gopal