[jira] [Created] (HIVE-12231) StorageBasedAuthorization requires write permission of default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION '$ExternalPath' "

2015-10-22 Thread WangMeng (JIRA)
WangMeng created HIVE-12231:
---

 Summary: StorageBasedAuthorization requires write permission of 
default Warehouse PATH when execute "CREATE DATABASE $Name LOCATION 
'$ExternalPath' "
 Key: HIVE-12231
 URL: https://issues.apache.org/jira/browse/HIVE-12231
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng


Please look at the stacktrace, when enabled StorageBasedAuthorization ,  I set 
external Location of creating database, it will also check write permission of 
default Warehouse "/user/hive/warehouse" :
> create  database test  location '/tmp/wangmeng/test'  ;
Error: Error while compiling statement: FAILED: HiveException 
java.security.AccessControlException: Permission denied: user=wangmeng, 
access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:255)
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:236)
at 
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12232) Create external table failed when enabled StorageBasedAuthorization

2015-10-22 Thread WangMeng (JIRA)
WangMeng created HIVE-12232:
---

 Summary: Create external table failed when enabled 
StorageBasedAuthorization
 Key: HIVE-12232
 URL: https://issues.apache.org/jira/browse/HIVE-12232
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng
Assignee: WangMeng


Please look at the stacktrace, when enabled StorageBasedAuthorization, creating 
external table will failed with write permission about the default warehouse 
path "/user/hive/warehouse": 

> CREATE EXTERNAL TABLE test(id int) LOCATION '/tmp/wangmeng/test'  ;
Error: Error while compiling statement: FAILED: HiveException 
java.security.AccessControlException: Permission denied: user=wangmeng, 
access=WRITE, inode="/user/hive/warehouse":hive:hive:drwxr-x--t.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12085) when using HiveServer2 JDBC, creating Hive DB will fail if setting DB location

2015-10-09 Thread WangMeng (JIRA)
WangMeng created HIVE-12085:
---

 Summary: when using HiveServer2 JDBC, creating Hive DB will  fail 
if  setting DB location 
 Key: HIVE-12085
 URL: https://issues.apache.org/jira/browse/HIVE-12085
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: WangMeng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11880) IndexOutOfBoundsException when execute query with filter condition on type incompatible column(A) on data(composed by UNION ALL when a union column is constant and it

2015-09-18 Thread WangMeng (JIRA)
WangMeng created HIVE-11880:
---

 Summary:IndexOutOfBoundsException when execute query with 
filter condition on type incompatible column(A) on data(composed by UNION ALL 
when a union column is constant and it has incompatible type with  
corresponding column) 
 Key: HIVE-11880
 URL: https://issues.apache.org/jira/browse/HIVE-11880
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.1
Reporter: WangMeng
Assignee: WangMeng


For Hive UNION ALL , when a union column is constant(column a) and it has 
incompatible type with the corresponding column A. The query with filter 
condition on type incompatible column a on this UNION-ALL results  will cause 
IndexOutOfBoundsException

such as TPC-H table orders:
CREATE VIEW `view_orders` AS select `oo`.`o_orderkey` , `oo`.`o_custkey`  from 
(  select  `orders`.`o_orderkey` , `rcfileorders`.`o_custkey` from 
`tpch270g`.`rcfileorders`   union all  select `orcfileorders`.`o_orderkey` , 0L 
as `o_custkey`   from  `tpch270g`.`textfileorders`) `oo`.

Type of "o_custkey" is INT,  the type of corresponding constant column 0 is 
BIGINT.
Then the fllowing query(with filter incompatible column 0_custkey)  will fail:
select count(1) from view_orders  where o_custkey<10 with  
java.lang.IndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11695) Hql "write to LOCAL DIRECTORY " can not throws exception when Hive user does not have write-promotion of the DIRECTORY

2015-08-31 Thread WangMeng (JIRA)
WangMeng created HIVE-11695:
---

 Summary:  Hql "write to LOCAL DIRECTORY " can not throws exception 
when Hive user does not have write-promotion of the DIRECTORY
 Key: HIVE-11695
 URL: https://issues.apache.org/jira/browse/HIVE-11695
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.1, 1.1.0, 1.2.0, 1.0.0, 0.14.0, 0.13.0
Reporter: WangMeng
Assignee: WangMeng


 For Hive user who dose not have write promotion of  LOCAL DIRECTORY such as   
"/data/wangmeng/"  , when the user executes Hql "insert  overwrite LOCAL  
DIRECTORY  "/data/wangmeng/hiveserver2" ,this query can not throw any exception 
 and pretend to have finished successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-11149) Fix issue with Thread unsafe Class HashMap in PerfLogger.java hangs in Multi-thread environment

2015-06-30 Thread WangMeng (JIRA)
WangMeng created HIVE-11149:
---

 Summary: Fix issue with Thread unsafe Class  HashMap in 
PerfLogger.java  hangs  in  Multi-thread environment
 Key: HIVE-11149
 URL: https://issues.apache.org/jira/browse/HIVE-11149
 Project: Hive
  Issue Type: Bug
  Components: Logging
Affects Versions: 1.2.0
Reporter: WangMeng
Assignee: WangMeng
 Fix For: 1.2.0


In  Multi-thread environment,  the Thread unsafe Class HashMap in 
PerfLogger.java  will hang  and cost  large amounts of unnecessary CPU and 
Memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10971) count(*) with count(distinct) gives wrong results when hive.groupby.skewindata=true

2015-06-09 Thread wangmeng (JIRA)
wangmeng created HIVE-10971:
---

 Summary: count(*) with count(distinct) gives wrong results when 
hive.groupby.skewindata=true
 Key: HIVE-10971
 URL: https://issues.apache.org/jira/browse/HIVE-10971
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.0
Reporter: wangmeng
Assignee: wangmeng


When hive.groupby.skewindata=true, the following query based on TPC-H gives 
wrong results:

{code}
set hive.groupby.skewindata=true;

select l_returnflag, count(*), count(distinct l_linestatus)
from lineitem
group by l_returnflag
limit 10;
{code}

The query plan shows that it generates only one MapReduce job instead of two, 
which is dictated by hive.groupby.skewindata=true.

The problem arises only when {noformat}count(*){noformat} and 
{noformat}count(distinct){noformat} exist together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-09-13 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133095#comment-14133095
 ] 

wangmeng commented on HIVE-7822:


thanks. for. your. advice! 发自网易邮箱手机版 在2014年09月14日 11:21,Xuefu Zhang (JIRA)写道: [ 
https://issues.apache.org/jira/browse/HIVE-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133064#comment-14133064
 ] Xuefu Zhang commented on HIVE-7822: --- 
[~wangmeng] JIRA is to report issues or request features. Questions like what 
you presented is better sent to user list. I'm closing this JIRA.  how to 
merge two  hive metastores' metadata  stored in different databases (such as 
mysql)  
--
                   Key: HIVE-7822                  URL: 
https://issues.apache.org/jira/browse/HIVE-7822              Project: Hive    
       Issue Type: Improvement             Reporter: wangmeng   Hi, What is 
a good way to merge  two hive metadata stored in different databases(such as 
mysql)?  Is there any way to get all history Hqls  from metastore?  I think  I 
need  to  run these  Hqls   in  another hive  metadata database  again.  
Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)


 how to merge two  hive metastores' metadata  stored in different databases 
 (such as mysql)
 --

 Key: HIVE-7822
 URL: https://issues.apache.org/jira/browse/HIVE-7822
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng

 Hi, What is a good way to merge  two hive metadata stored in different 
 databases(such as mysql)?
 Is there any way to get all history Hqls  from metastore?  I think  I need  
 to  run these  Hqls   in  another hive  metadata database  again.
 Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-3421) Column Level Top K Values Statistics

2014-08-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106624#comment-14106624
 ] 

wangmeng commented on HIVE-3421:


this is very useful!!!  I  am  waiting  the  coming  version

 Column Level Top K Values Statistics
 

 Key: HIVE-3421
 URL: https://issues.apache.org/jira/browse/HIVE-3421
 Project: Hive
  Issue Type: New Feature
Reporter: Feng Lu
Assignee: Feng Lu
 Attachments: HIVE-3421.patch.1.txt, HIVE-3421.patch.2.txt, 
 HIVE-3421.patch.3.txt, HIVE-3421.patch.4.txt, HIVE-3421.patch.5.txt, 
 HIVE-3421.patch.6.txt, HIVE-3421.patch.7.txt, HIVE-3421.patch.8.txt, 
 HIVE-3421.patch.9.txt, HIVE-3421.patch.txt


 Compute (estimate) top k values statistics for each column, and put the most 
 skewed column into skewed info, if user hasn't specified skew.
 This feature depends on ListBucketing (create table skewed on) 
 https://cwiki.apache.org/Hive/listbucketing.html.
 All column topk can be added to skewed info, if in the future skewed info 
 supports multiple independent columns.
 The TopK algorithm is based on this paper:
 http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7822) how to merge two hive metastores' metadata stored in different databases (such as mysql)

2014-08-20 Thread wangmeng (JIRA)
wangmeng created HIVE-7822:
--

 Summary: how to merge two  hive metastores' metadata  stored in 
different databases (such as mysql)
 Key: HIVE-7822
 URL: https://issues.apache.org/jira/browse/HIVE-7822
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng


Hi, What is a good way to merge  two hive metadata stored in different 
databases(such as mysql)?

Is there any way to get all history Hqls  from metastore?  I think  I need  to  
run these  Hqls   in  another hive  metadata database  again.

Thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-964) handle skewed keys for a join in a separate job

2014-07-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069997#comment-14069997
 ] 

wangmeng commented on HIVE-964:
---

if the two join tables  have the same big skew key on one value (for example 
,select *  from  table A join B  on  A.id=b.id,  both table A  and B  have  a 
lot of  keys on id=1,  in  this  case ,map join  will OOM),  how  to fix this  
case?  Will  it  rollback  to common  join ? 

 handle skewed keys for a join in a separate job
 ---

 Key: HIVE-964
 URL: https://issues.apache.org/jira/browse/HIVE-964
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0

 Attachments: hive-964-2009-12-17.txt, hive-964-2009-12-28-2.patch, 
 hive-964-2009-12-29-4.patch, hive-964-2010-01-08.patch, 
 hive-964-2010-01-13-2.patch, hive-964-2010-01-14-3.patch, 
 hive-964-2010-01-15-4.patch


 The skewed keys can be written to a temporary table or file, and a followup 
 conditional task can be used to perform the join on those keys.
 As a first step, JDBM can be used for those keys



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)
wangmeng created HIVE-7469:
--

 Summary: skew join keys  when  two join  table  have the same big 
skew key
 Key: HIVE-7469
 URL: https://issues.apache.org/jira/browse/HIVE-7469
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng


In https://issues.apache.org/jira/browse/HIVE-964, I  have an general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have an general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


 skew join keys  when  two join  table  have the same big skew key
 -

 Key: HIVE-7469
 URL: https://issues.apache.org/jira/browse/HIVE-7469
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng

 In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
 about how to  deal with skew join key ,but there has a case  which troubles 
 me:
 if the two join tables  have the same big skew key on one value :
 for example , select *  from  table A join B  on  A.id=b.id,  both table A  
 and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
 to deal with   the skew key  id=1  ,maybe itwill OOM.
 so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key  point is that  use mapjoin to 
deal with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key is that  use mapjoin to deal 
with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


 skew join keys  when  two join  table  have the same big skew key
 -

 Key: HIVE-7469
 URL: https://issues.apache.org/jira/browse/HIVE-7469
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng

 In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
 about how to  deal with skew join key , the key  point is that  use mapjoin 
 to deal with skew key, but there has a case  which troubles me:
 if the two join tables  have the same big skew key on one value :
 for example , select *  from  table A join B  on  A.id=b.id,  both table A  
 and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
 to deal with   the skew key  id=1  ,maybe itwill OOM.
 so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7469) skew join keys when two join table have the same big skew key

2014-07-22 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7469:
---

Description: 
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key , the key is that  use mapjoin to deal 
with skew key, but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.

  was:
In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
about how to  deal with skew join key ,but there has a case  which troubles me:
if the two join tables  have the same big skew key on one value :
for example , select *  from  table A join B  on  A.id=b.id,  both table A  and 
B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  to deal 
with   the skew key  id=1  ,maybe itwill OOM.
so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.


 skew join keys  when  two join  table  have the same big skew key
 -

 Key: HIVE-7469
 URL: https://issues.apache.org/jira/browse/HIVE-7469
 Project: Hive
  Issue Type: Improvement
Reporter: wangmeng

 In https://issues.apache.org/jira/browse/HIVE-964, I  have a  general   idea 
 about how to  deal with skew join key , the key is that  use mapjoin to deal 
 with skew key, but there has a case  which troubles me:
 if the two join tables  have the same big skew key on one value :
 for example , select *  from  table A join B  on  A.id=b.id,  both table A  
 and B  have  a lot of  keys on id=1,  in  this  case , if we  use map join  
 to deal with   the skew key  id=1  ,maybe itwill OOM.
 so ,how  to fix this  case?  Will  it  rollback  to common  join ? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7292) Hive on Spark

2014-07-22 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070044#comment-14070044
 ] 

wangmeng commented on HIVE-7292:


This is a very valuable project!

 Hive on Spark
 -

 Key: HIVE-7292
 URL: https://issues.apache.org/jira/browse/HIVE-7292
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: Hive-on-Spark.pdf


 Spark as an open-source data analytics cluster computing framework has gained 
 significant momentum recently. Many Hive users already have Spark installed 
 as their computing backbone. To take advantages of Hive, they still need to 
 have either MapReduce or Tez on their cluster. This initiative will provide 
 user a new alternative so that those user can consolidate their backend. 
 Secondly, providing such an alternative further increases Hive's adoption as 
 it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
 on Hadoop.
 Finally, allowing Hive to run on Spark also has performance benefits. Hive 
 queries, especially those involving multiple reducer stages, will run faster, 
 thus improving user experience as Tez does.
 This is an umbrella JIRA which will cover many coming subtask. Design doc 
 will be attached here shortly, and will be on the wiki as well. Feedback from 
 the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-07-05 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052834#comment-14052834
 ] 

wangmeng commented on HIVE-7296:


yes,I like it.





--

Best  Regards
HomePage:http://wangmeng.us/
Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, 
Distributed system , Hadoop  Hive , Performancse Optimization and Debug 
,Spark/Shark Impala
Major: Software Engineering --
Degree:  Master
E-mail:   sjtufigh...@163.com   sjtufigh...@sjtu.edu.cn
Tel: 13141202303(BeiJing)   18818272832(ShangHai)
GitHub:https://github.com/sjtufighter







 big data approximate processing  at a very  low cost  based on hive sql 
 

 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng

 For big data analysis, we often need to do the following query and statistics:
 1.Cardinality Estimation,   count the number of different elements in the 
 collection, such as Unique Visitor ,UV)
 Now we can use hive-query:
 Select distinct(id)  from TestTable ;
 2.Frequency Estimation: estimate number of an element is repeated, such as 
 the site visits of  a user 。
 Hive query: select  count(1)  from TestTable where name=”wangmeng”
 3.Heavy Hitters, top-k elements: such as top-100 shops 
 Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
 4.Range Query: for example, to find out the number of  users between 20 to 30
 Hive query : select  count(1) from TestTable where age20 and age 30
 5.Membership Query : for example, whether  the user name is already 
 registered?
 According to the implementation mechanism of hive , it  will cost too large 
 memory space and a long query time.
 However ,in many cases, we do not need very accurate results and a small 
 error can be tolerated. In such case  , we can use  approximate processing  
 to greatly improve the time and space efficiency.
 Now , based  on some theoretical analysis materials ,I want to  do some for 
 these new features so much if possible. 
 So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-27 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. 

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



 big data approximate processing  at a very  low cost  based on hive sql 
 

 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng

 For big data analysis, we often need to do the following query and statistics:
 1.Cardinality Estimation,   count the number of different elements in the 
 collection, such as Unique Visitor ,UV)
 Now we can use hive-query:
 Select distinct(id)  from TestTable ;
 2.Frequency Estimation: estimate number of an element is repeated, such as 
 the site visits of  a user 。
 Hive query: select  count(1)  from TestTable where name=”wangmeng”
 3.Heavy Hitters, top-k elements: such as top-100 shops 
 Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
 4.Range Query: for example, to find out the number of  users between 20 to 30
 Hive query : select  count(1) from TestTable where age20 and age 30
 5.Membership Query : for example, whether  the user name is already 
 registered?
 According to the implementation mechanism of hive , it  will cost too large 
 memory space and a long query time.
 However ,in many cases, we do not need very accurate results and a small 
 error can be tolerated. In such case  , we can use  approximate processing  
 to greatly improve the time and space efficiency.
 Now , based  on some theoretical analysis materials ,I want to  do some for 
 these new features so much if possible. 
 So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-27 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045604#comment-14045604
 ] 

wangmeng commented on HIVE-7296:


Sorry,  they are different   features

 big data approximate processing  at a very  low cost  based on hive sql 
 

 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng

 For big data analysis, we often need to do the following query and statistics:
 1.Cardinality Estimation,   count the number of different elements in the 
 collection, such as Unique Visitor ,UV)
 Now we can use hive-query:
 Select distinct(id)  from TestTable ;
 2.Frequency Estimation: estimate number of an element is repeated, such as 
 the site visits of  a user 。
 Hive query: select  count(1)  from TestTable where name=”wangmeng”
 3.Heavy Hitters, top-k elements: such as top-100 shops 
 Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
 4.Range Query: for example, to find out the number of  users between 20 to 30
 Hive query : select  count(1) from TestTable where age20 and age 30
 5.Membership Query : for example, whether  the user name is already 
 registered?
 According to the implementation mechanism of hive , it  will cost too large 
 memory space and a long query time.
 However ,in many cases, we do not need very accurate results and a small 
 error can be tolerated. In such case  , we can use  approximate processing  
 to greatly improve the time and space efficiency.
 Now , based  on some theoretical analysis materials ,I want to  do some for 
 these new features so much if possible. .
 I am familiar with hive and  hadoop , and  I have implemented an efficient  
 storage format based on hive.( 
 https://github.com/sjtufighter/Data---Storage--).
 So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much  

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



 big data approximate processing  at a very  low cost  based on hive sql 
 

 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng

 For big data analysis, we often need to do the following query and statistics:
 1.Cardinality Estimation,   count the number of different elements in the 
 collection, such as Unique Visitor ,UV)
 Now we can use hive-query:
 Select distinct(id)  from TestTable ;
 2.Frequency Estimation: estimate number of an element is repeated, such as 
 the site visits of  a user 。
 Hive query: select  count(1)  from TestTable where name=”wangmeng”
 3.Heavy Hitters, top-k elements: such as top-100 shops 
 Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
 4.Range Query: for example, to find out the number of  users between 20 to 30
 Hive query : select  count(1) from TestTable where age20 and age 30
 5.Membership Query : for example, whether  the user name is already 
 registered?
 According to the implementation mechanism of hive , it  will cost too large 
 memory space and a long query time.
 However ,in many cases, we do not need very accurate results and a small 
 error can be tolerated. In such case  , we can use  approximate processing  
 to greatly improve the time and space efficiency.
 Now , based  on some theoretical analysis materials ,I want to  do some for 
 these new features so much  
 I am familiar with hive and  hadoop , and  I have implemented an efficient  
 storage format based on hive.( 
 https://github.com/sjtufighter/Data---Storage--).
 So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-26 Thread wangmeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangmeng updated HIVE-7296:
---

Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much if possible. .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much  

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.



 big data approximate processing  at a very  low cost  based on hive sql 
 

 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng

 For big data analysis, we often need to do the following query and statistics:
 1.Cardinality Estimation,   count the number of different elements in the 
 collection, such as Unique Visitor ,UV)
 Now we can use hive-query:
 Select distinct(id)  from TestTable ;
 2.Frequency Estimation: estimate number of an element is repeated, such as 
 the site visits of  a user 。
 Hive query: select  count(1)  from TestTable where name=”wangmeng”
 3.Heavy Hitters, top-k elements: such as top-100 shops 
 Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
 4.Range Query: for example, to find out the number of  users between 20 to 30
 Hive query : select  count(1) from TestTable where age20 and age 30
 5.Membership Query : for example, whether  the user name is already 
 registered?
 According to the implementation mechanism of hive , it  will cost too large 
 memory space and a long query time.
 However ,in many cases, we do not need very accurate results and a small 
 error can be tolerated. In such case  , we can use  approximate processing  
 to greatly improve the time and space efficiency.
 Now , based  on some theoretical analysis materials ,I want to  do some for 
 these new features so much if possible. .
 I am familiar with hive and  hadoop , and  I have implemented an efficient  
 storage format based on hive.( 
 https://github.com/sjtufighter/Data---Storage--).
 So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7296) big data approximate processing at a very low cost based on hive sql

2014-06-25 Thread wangmeng (JIRA)
wangmeng created HIVE-7296:
--

 Summary: big data approximate processing  at a very  low cost  
based on hive sql 
 Key: HIVE-7296
 URL: https://issues.apache.org/jira/browse/HIVE-7296
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng


For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the 
collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the 
site visits of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age20 and age 30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large 
memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error 
can be tolerated. In such case  , we can use  approximate processing  to 
greatly improve the time and space efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for 
these new features so much .

I am familiar with hive and  hadoop , and  I have implemented an efficient  
storage format based on hive.( 
https://github.com/sjtufighter/Data---Storage--).

So, is there anything I can do ?  Many Thanks.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-24 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041741#comment-14041741
 ] 

wangmeng commented on HIVE-7277:


As  I  know ,TEZ is a new  compute engine  different from mapreduce,   is there 
 any  solution based on map reduce engine  ?

 how to decide reduce numbers   according  to  the input size of reduce stage 
 rather than the  input size of  map stage?
 ---

 Key: HIVE-7277
 URL: https://issues.apache.org/jira/browse/HIVE-7277
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng
 Fix For: 0.13.0


 As we  know ,now  hive decide the  reduce numbers  just by  the  Input size 
 of   map/ hive.exec.reducers.bytes.per.reducer(default 1G ).
 But ,I  think  the out put size of map stage  may have a big difference from  
 the original  input size , so I  think  this  strategy to decide 
 reduce-numbers may be improper
 So is   there any feature  which can decide the reduce number just  according 
 to the out put  of the map stage.?thanks .  
  As  I know , actually ,the reduce stage will begin just  after some map 
 tasks have finished rather than until  the  whole map stage have finished , 
 so I  think  it is improper too  decide reduce numbers   when  the  whole map 
 stage  have finished.
 As  someone point ,We can just according to  the out put size of the  
 earliest map tasks which have finished   to  estimate the whole reduce 
 numbers..However,   in fact ,now Hive has used filter push down(where) 
 ,which may  resulting a big  difference from each map task .
 So,  this  estimation  is improper.
 thanks .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-24 Thread wangmeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041761#comment-14041761
 ] 

wangmeng commented on HIVE-7277:


well, the MR api  can not  fit up  with  this logical  plan  to  generate the 
physical plan .





--

Best  Regards
HomePage:http://wangmeng.us/
Name:Wang Meng---Data structures and Algorithms,Java,Jvm, Linux, Shell, 
Distributed system , Hadoop  Hive , Performancse Optimization and Debug 
,Spark/Shark 
Major: Software Engineering
Degree:  Master
E-mail:   sjtufigh...@163.com   sjtufigh...@sjtu.edu.cn
GitHub:https://github.com/sjtufighter







 how to decide reduce numbers   according  to  the input size of reduce stage 
 rather than the  input size of  map stage?
 ---

 Key: HIVE-7277
 URL: https://issues.apache.org/jira/browse/HIVE-7277
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng
 Fix For: 0.13.0


 As we  know ,now  hive decide the  reduce numbers  just by  the  Input size 
 of   map/ hive.exec.reducers.bytes.per.reducer(default 1G ).
 But ,I  think  the out put size of map stage  may have a big difference from  
 the original  input size , so I  think  this  strategy to decide 
 reduce-numbers may be improper
 So is   there any feature  which can decide the reduce number just  according 
 to the out put  of the map stage.?thanks .  
  As  I know , actually ,the reduce stage will begin just  after some map 
 tasks have finished rather than until  the  whole map stage have finished , 
 so I  think  it is improper too  decide reduce numbers   when  the  whole map 
 stage  have finished.
 As  someone point ,We can just according to  the out put size of the  
 earliest map tasks which have finished   to  estimate the whole reduce 
 numbers..However,   in fact ,now Hive has used filter push down(where) 
 ,which may  resulting a big  difference from each map task .
 So,  this  estimation  is improper.
 thanks .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7277) how to decide reduce numbers according to the input size of reduce stage rather than the input size of map stage?

2014-06-23 Thread wangmeng (JIRA)
wangmeng created HIVE-7277:
--

 Summary: how to decide reduce numbers   according  to  the input 
size of reduce stage rather than the  input size of  map stage?
 Key: HIVE-7277
 URL: https://issues.apache.org/jira/browse/HIVE-7277
 Project: Hive
  Issue Type: New Feature
Reporter: wangmeng
 Fix For: 0.13.0


As we  know ,now  hive decide the  reduce numbers  just by  the  Input size of 
  map/ hive.exec.reducers.bytes.per.reducer(default 1G ).

But ,I  think  the out put size of map stage  may have a big difference from  
the original  input size , so I  think  this  strategy to decide reduce-numbers 
may be improper

So is   there any feature  which can decide the reduce number just  according 
to the out put  of the map stage.?thanks .  

 As  I know , actually ,the reduce stage will begin just  after some map tasks 
have finished rather than until  the  whole map stage have finished , so I  
think  it is improper too  decide reduce numbers   when  the  whole map stage  
have finished.

As  someone point ,We can just according to  the out put size of the  earliest 
map tasks which have finished   to  estimate the whole reduce 
numbers..However,   in fact ,now Hive has used filter push down(where) 
,which may  resulting a big  difference from each map task .

So,  this  estimation  is improper.

thanks .




--
This message was sent by Atlassian JIRA
(v6.2#6252)