> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except 
was implemented for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal 
is to pick this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I 
recommend looking at Zoltan's recent work which tracks query plans + runtime 
statistics from the RUNTIME_STATS table in the metastore.

You can debug through what this does by doing

"explain reoptimization  <query>;"

Cheers,
Gopal


Reply via email to