Hello,

We have many HQL scripts that select from nested sub-selects. In many cases
the nesting can be a few levels deep:

SELECT ... FROM (
    SELECT ... FROM (
        SELECT ... FROM (
            SELECT ... FROM a WHERE ...
        ) A LEFT JOIN (
            SELECT ... FROM b
        ) B ON (...)
    ) ab FULL OUTER JOIN (
        SELECT ... FROM c WHERE ...
  ) C ON (...)
) abc LEFT JOIN (
  SELECT ... FROM d WHERE ...
) D ON (...)
GROUP BY ...;


Such queries are difficult to test as they are complex monoliths. While we
can decompose and modularise them at the column level with UDFs and macros,
it is not as evident to me how best to break up the nested queries into
separate components. I consider the options to be:

1. Make each statement a top level query that has an associated table

CREATE TABLE Y AS
SELECT ... FROM X ...

CREATE TABLE Z AS
SELECT ... FROM Y ...


The problems with this approach as I see it are that:

   - The resultant process will generate far more HDFS I/O as each query
   will write a table which is then read by the parent query.
   - As each query results in an isolated job, Hive will not be able to
   optimise 'across queries'. I do not know how Hive's optimiser functions so
   this is no more than a naïve assumption.

2. Encapsulate each query as a view. In this way the statements are defined
in isolation and merely describe a processing step rather than the
processed data itself.

CREATE VIEW Y AS
SELECT ... FROM X ...

CREATE VIEW Z AS
SELECT ... FROM Y ...


I'd expect that a given query would result in the same execution plan
regardless of whether it was delivered as a single inlined statement or as
a composition of views. However, for some time I've heard that views
perform poorly and should be avoided.

Clearly I should try out a few queries concretely, but with respect to this
issue can someone illuminate me on:

   - What is considered the best practice for modularising this type of
   query?
   - Is there a penalty for using views over a fully inlines query?
   - Are there any other options that I haven't considered?

Many thanks - Elliot.

Reply via email to