[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101952#comment-13101952
]
Yin Huai commented on HIVE-2206:
--------------------------------
Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18.
My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a
500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of
Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of
Q18 is 1737.18s without the patch versus 867.334s with the patch.
I am facing a issue which I have not found a good way to solve. Suppose that we
have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON
L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R
GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is
necessary. However, because Hive will use R.c1 and R.c2 as the key columns of
the original ReduceSinkOperator for the sub-query involving distinct count
function, it is impossible to merged MapReduce jobs of two sub-queries into
one. To optimize this kind of query, I write a new UDF function
count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function
use a HashSet to get the number of distinct records. Is there any better
solution for optimizing this kind of queries? Thanks.
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
> Key: HIVE-2206
> URL: https://issues.apache.org/jira/browse/HIVE-2206
> Project: Hive
> Issue Type: New Feature
> Reporter: He Yongqiang
> Assignee: Yin Huai
> Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira