Re: Maintaining big and complex Hive queries

2016-12-21 Thread Edward Capriolo
I have been contemplating attaching meta data for the query lineage to each
table such that I can know where the data came from and have a 1 click
regenerate button.

On Wed, Dec 21, 2016 at 3:02 PM, Stephen Sprague  wrote:

> my 2 cents. :)
>
> as soon as you say "complex query" i would submit you've lost the
> upperhand and you're behind the eight-ball right off the bat.  And you know
> this too otherwise you wouldn't have posted here. ha!
>
> i use cascading CTAS statements so that i can examine the intermediate
> tables.  Another approach is to use CTE's but while that makes things
> easier to read it's still one big query and you don't get insight to the
> "work" tables.
>
> yes, it could take longer execution time if those intermediate tables
> can't be run in parallel but small price to pay compared to human debug
> time in my book anyway.
>
> thoughts?
>
> Cheers,
> Stephen.
>
>
>
>
>
> On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hi Elliot,
>>
>> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
>> in the documentation I could not see any way to pass output generated by
>> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
>> platform for multiple big-data systems (http://www.hplsql.org/about).
>>
>> However in order to break single complex hive query, DDLs look to be only
>> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>>
>> -- Saumitra S. Shahapure
>>
>> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>>
>>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>>> I expect that is another approach that you could use to modularise:
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=59690156
>>> http://www.hplsql.org/doc
>>>
>>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>>
 Some options are covered here, although there is no definitive guidance
 as far as I know:

 https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
 g+Hive+SQL#UnitTestingHiveSQL-Modularisation

 On 15 December 2016 at 17:08, Saumitra Shahapure <
 saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is
> not very smooth. We would end up polluting metadata database by creating
> views / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries
> written in Hive? Are there better ways to break Hive query into multiple
> modules?
>
> -- Saumitra S. Shahapure
>


>>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Stephen Sprague
my 2 cents. :)

as soon as you say "complex query" i would submit you've lost the upperhand
and you're behind the eight-ball right off the bat.  And you know this too
otherwise you wouldn't have posted here. ha!

i use cascading CTAS statements so that i can examine the intermediate
tables.  Another approach is to use CTE's but while that makes things
easier to read it's still one big query and you don't get insight to the
"work" tables.

yes, it could take longer execution time if those intermediate tables can't
be run in parallel but small price to pay compared to human debug time in
my book anyway.

thoughts?

Cheers,
Stephen.





On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hi Elliot,
>
> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
> in the documentation I could not see any way to pass output generated by
> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
> platform for multiple big-data systems (http://www.hplsql.org/about).
>
> However in order to break single complex hive query, DDLs look to be only
> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>
> -- Saumitra S. Shahapure
>
> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>
>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>> I expect that is another approach that you could use to modularise:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
>> http://www.hplsql.org/doc
>>
>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>
>>> Some options are covered here, although there is no definitive guidance
>>> as far as I know:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>
>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>> saumitra.offic...@gmail.com> wrote:
>>>
 Hello,

 We are running and maintaining quite big and complex Hive SELECT query
 right now. It's basically a single SELECT query which performs JOIN of
 about ten other SELECT query outputs.

 A simplest way to refactor that we can think of is to break this query
 down into multiple views and then join the views. There is similar
 possibility to create intermediate tables.

 However creating multiple DDLs in order to maintain a single DML is not
 very smooth. We would end up polluting metadata database by creating views
 / intermediate tables which are used in just this ETL.

 What are the other efficient ways to maintain complex SQL queries
 written in Hive? Are there better ways to break Hive query into multiple
 modules?

 -- Saumitra S. Shahapure

>>>
>>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Saumitra Shahapure
Hi Elliot,

Thanks for letting me know. HPL-SQL sounded particularly interesting. But
in the documentation I could not see any way to pass output generated by
one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
platform for multiple big-data systems (http://www.hplsql.org/about).

However in order to break single complex hive query, DDLs look to be only
way in HPL-SQL too. Or is there any alternate way that I might have missed?

-- Saumitra S. Shahapure

On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:

> I notice that HPL/SQL is not mentioned on the page I referenced, however I
> expect that is another approach that you could use to modularise:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
> http://www.hplsql.org/doc
>
> On 15 December 2016 at 17:17, Elliot West  wrote:
>
>> Some options are covered here, although there is no definitive guidance
>> as far as I know:
>>
>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>
>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>> saumitra.offic...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We are running and maintaining quite big and complex Hive SELECT query
>>> right now. It's basically a single SELECT query which performs JOIN of
>>> about ten other SELECT query outputs.
>>>
>>> A simplest way to refactor that we can think of is to break this query
>>> down into multiple views and then join the views. There is similar
>>> possibility to create intermediate tables.
>>>
>>> However creating multiple DDLs in order to maintain a single DML is not
>>> very smooth. We would end up polluting metadata database by creating views
>>> / intermediate tables which are used in just this ETL.
>>>
>>> What are the other efficient ways to maintain complex SQL queries
>>> written in Hive? Are there better ways to break Hive query into multiple
>>> modules?
>>>
>>> -- Saumitra S. Shahapure
>>>
>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
I notice that HPL/SQL is not mentioned on the page I referenced, however I
expect that is another approach that you could use to modularise:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
http://www.hplsql.org/doc

On 15 December 2016 at 17:17, Elliot West  wrote:

> Some options are covered here, although there is no definitive guidance as
> far as I know:
>
> https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#
> UnitTestingHiveSQL-Modularisation
>
> On 15 December 2016 at 17:08, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hello,
>>
>> We are running and maintaining quite big and complex Hive SELECT query
>> right now. It's basically a single SELECT query which performs JOIN of
>> about ten other SELECT query outputs.
>>
>> A simplest way to refactor that we can think of is to break this query
>> down into multiple views and then join the views. There is similar
>> possibility to create intermediate tables.
>>
>> However creating multiple DDLs in order to maintain a single DML is not
>> very smooth. We would end up polluting metadata database by creating views
>> / intermediate tables which are used in just this ETL.
>>
>> What are the other efficient ways to maintain complex SQL queries written
>> in Hive? Are there better ways to break Hive query into multiple modules?
>>
>> -- Saumitra S. Shahapure
>>
>
>


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
Some options are covered here, although there is no definitive guidance as
far as I know:

https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#UnitTestingHiveSQL-Modularisation

On 15 December 2016 at 17:08, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is not
> very smooth. We would end up polluting metadata database by creating views
> / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries written
> in Hive? Are there better ways to break Hive query into multiple modules?
>
> -- Saumitra S. Shahapure
>


Maintaining big and complex Hive queries

2016-12-15 Thread Saumitra Shahapure
Hello,

We are running and maintaining quite big and complex Hive SELECT query
right now. It's basically a single SELECT query which performs JOIN of
about ten other SELECT query outputs.

A simplest way to refactor that we can think of is to break this query down
into multiple views and then join the views. There is similar possibility
to create intermediate tables.

However creating multiple DDLs in order to maintain a single DML is not
very smooth. We would end up polluting metadata database by creating views
/ intermediate tables which are used in just this ETL.

What are the other efficient ways to maintain complex SQL queries written
in Hive? Are there better ways to break Hive query into multiple modules?

-- Saumitra S. Shahapure