[jira] [Created] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

wangmeng (JIRA) Wed, 25 Jun 2014 21:50:23 -0700

wangmeng created HIVE-7296:
------------------------------

             Summary: big data approximate processing  at a very  low cost  
based on hive sql 
                 Key: HIVE-7296
                 URL: https://issues.apache.org/jira/browse/HIVE-7296
             Project: Hive
          Issue Type: New Feature
            Reporter: wangmeng



For big data analysis, we often need to do the following query and statistics：

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/----Data---Storage--).

So, is there anything I can do ?  Many Thanks.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

Reply via email to