[ANNOUNCE] Hive 0.5.0 released

2010-02-24 Thread Zheng Shao
Hi folks,

We have released Hive 0.5.0.
You can find it from the download page in 24 hours (still waiting to
be mirrored)

http://hadoop.apache.org/hive/releases.html#Download

-- 
Yours,
Zheng


Build failed in Hudson: Hive-trunk-h0.20 #198

2010-02-24 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/198/

--
[...truncated 13323 lines...]
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table src
[junit] POSTHOOK: Output: defa...@src
[junit] OK
[junit] Loading data to table src1
[junit] POSTHOOK: Output: defa...@src1
[junit] OK
[junit] Loading data to table src_sequencefile
[junit] POSTHOOK: Output: defa...@src_sequencefile
[junit] OK
[junit] Loading data to table src_thrift
[junit] POSTHOOK: Output: defa...@src_thrift
[junit] OK
[junit] Loading data to table src_json
[junit] POSTHOOK: Output: defa...@src_json
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_function4.q.out
 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_function4.q.out
[junit] Done query: unknown_function4.q
[junit] Begin query: unknown_table1.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table src
[junit] POSTHOOK: Output: defa...@src
[junit] OK
[junit] Loading data to table src1
[junit] POSTHOOK: Output: defa...@src1
[junit] OK
[junit] Loading data to table src_sequencefile
[junit] POSTHOOK: Output: defa...@src_sequencefile
[junit] OK
[junit] Loading data to table src_thrift
[junit] POSTHOOK: Output: defa...@src_thrift
[junit] OK
[junit] Loading data to table src_json
[junit] POSTHOOK: Output: defa...@src_json
[junit] OK
[junit] diff 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_table1.q.out
 
http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_table1.q.out
[junit] Done query: unknown_table1.q
[junit] Begin query: unknown_table2.q
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11
[junit] OK
[junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12}
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: 

[jira] Created: (HIVE-1193) ensure sorting properties for a table

2010-02-24 Thread Namit Jain (JIRA)
ensure sorting properties for a table
-

 Key: HIVE-1193
 URL: https://issues.apache.org/jira/browse/HIVE-1193
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.6.0


If a table is sorted, and data is being inserted into that - currently, we dont 
make sure that data is sorted. That might be useful some downstream operations.
This cannot be made the default due to backward compatibility, but an option 
can be added for the same

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1194) sorted merge join

2010-02-24 Thread Namit Jain (JIRA)
sorted merge join
-

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


If the input tables are sorted on the join key, and a mapjoin is being 
performed, it is useful to exploit the sorted properties of the table.
This can lead to substantial cpu savings - this needs to work across bucketed 
map joins also.

Since, sorted properties of a table are not enforced currently, a new parameter 
can be added to specify to use the sort-merge join.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Zheng Shao (JIRA)
Increase ObjectInspector[] length on demand
---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao


{code}
Operator.java
  protected transient ObjectInspector[] inputObjInspectors = new 
ObjectInspector[Short.MAX_VALUE];
{code}

An array of 32K elements takes 256KB memory under 64-bit Java.
We are seeing hive client going out of memory because of that.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-1195:
-

Attachment: HIVE-1195.1.patch

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Attachments: HIVE-1195.1.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-1195:
-

Fix Version/s: 0.6.0
   0.5.1
   Status: Patch Available  (was: Open)

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195.1.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838080#action_12838080
 ] 

Ning Zhang commented on HIVE-1195:
--

+1 Will commit after tests. 

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195.1.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang updated HIVE-1195:
-

Attachment: HIVE-1195-branch-0.5.patch

Uploading a patch for branch 0.5. Zheng, can you double check?

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

2010-02-24 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838095#action_12838095
 ] 

Carl Steinbach commented on HIVE-535:
-

The folks working on Mahout seem to think the CERN license is compatible with 
Apache. They have
already imported cern.colt*, cern.jet* and cern.clhep into their source tree. 
See MAHOUT-222.

Check out the update to their LICENSE.txt file: 
http://svn.apache.org/repos/asf/lucene/mahout/trunk/LICENSE.txt

 Memory-efficient hash-based Aggregation
 ---

 Key: HIVE-535
 URL: https://issues.apache.org/jira/browse/HIVE-535
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.4.0
Reporter: Zheng Shao

 Currently there are a lot of memory overhead in the hash-based aggregation in 
 GroupByOperator.
 The net result is that GroupByOperator won't be able to store many entries in 
 its HashTable, and flushes frequently, and won't be able to achieve very good 
 partial aggregation result.
 Here are some initial thoughts (some of them are from Joydeep long time ago):
 A1. Serialize the key of the HashTable. This will eliminate the 16-byte 
 per-object overhead of Java in keys (depending on how many objects there are 
 in the key, the saving can be substantial).
 A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 
 bytes of overhead per entry.
 A3. Use primitive array to store aggregation results. Basically, the UDAF 
 should manage the array of aggregation results, so UDAFCount should manage a 
 long[], UDAFAvg should manage a double[] and a long[]. The external code 
 should pass an index to iterate/merge/terminal an aggregation result. This 
 will eliminate the 16-byte per-object overhead of Java.
 More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1196) Railroad Diagrams for Hive Language Manual

2010-02-24 Thread Carl Steinbach (JIRA)
Railroad Diagrams for Hive Language Manual
--

 Key: HIVE-1196
 URL: https://issues.apache.org/jira/browse/HIVE-1196
 Project: Hadoop Hive
  Issue Type: Task
  Components: Documentation
Reporter: Carl Steinbach
Priority: Minor


Add railroad diagrams (syntax diagrams) to the Hive Language Manual.

* The [ANTLRWorks IDE|http://www.antlr.org/works/index.html] generates railroad 
diagrams and allows you to export them as EPS.
* [Clapham|http://sourceforge.net/projects/clapham/] is another tool for 
generating railroad diagrams based on BNF style inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-1195:
-

Attachment: HIVE-1195.2.patch
HIVE-1195.2.branch-0.5.patch

Fixed an obvious bug which caused unit test failures.

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, 
 HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838112#action_12838112
 ] 

Ning Zhang commented on HIVE-1195:
--

Zheng, join26.q , join_map_ppr.q , union16.q, union9.q, failed on trunk. Can 
you take a look?

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, 
 HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang updated HIVE-1195:
-

Status: Open  (was: Patch Available)

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, 
 HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838113#action_12838113
 ] 

Namit Jain commented on HIVE-1194:
--

Based on a offline discussion with Yongqiang, we were thinking of the following:


There will be a new mapping in MapredWork -
Operator - MapredLocalWork

This will be populated for SortMergeJoinOperator only.

SortMergeJoinOperator is a new operator which extends MapJoinOperator, and has 
the
same name as a MapJoinOperator.

MapJoinProcessor needs to create a SortMergeJoinOperator instead of a 
MapJoinOperator
when it sees the new configuration parameter.

MapJoinFactory methods need to change to create Operator-MapredLocalWork 
instead of
MapredLocalWork in MapredWork.

 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838114#action_12838114
 ] 

Ning Zhang commented on HIVE-1195:
--

Cool. I'll take the new patches to test. 

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, 
 HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1197) create a new input format where a mapper spans a file

2010-02-24 Thread Namit Jain (JIRA)
create a new input format where a mapper spans a file
-

 Key: HIVE-1197
 URL: https://issues.apache.org/jira/browse/HIVE-1197
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Fix For: 0.6.0


This will be needed for Sort merge joins.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838118#action_12838118
 ] 

Zheng Shao commented on HIVE-259:
-

Also see http://wiki.apache.org/hadoop/Hive/HowToContribute#Coding_Convention

 Add PERCENTILE aggregate function
 -

 Key: HIVE-259
 URL: https://issues.apache.org/jira/browse/HIVE-259
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Venky Iyer
Assignee: Jerome Boulon
 Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, 
 jb2.txt, Percentile.xlsx


 Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838119#action_12838119
 ] 

Zheng Shao commented on HIVE-259:
-

The test cases looks a bit too trivial or the results have problems? They 
always return the same number for the 3 different percentile values.


 Add PERCENTILE aggregate function
 -

 Key: HIVE-259
 URL: https://issues.apache.org/jira/browse/HIVE-259
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Venky Iyer
Assignee: Jerome Boulon
 Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, 
 jb2.txt, Percentile.xlsx


 Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838120#action_12838120
 ] 

Zheng Shao commented on HIVE-1194:
--

Why does SortMergeJoinOperator extends MapJoinOperator?
It seems to me that SortMergeJoinOperator does NOTneed the 
in-memory/disk-backed HashMap that MapJoinOperator has, correct?


 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838121#action_12838121
 ] 

Namit Jain commented on HIVE-1194:
--

Yes, but it happens on the mapper. It is a special type of mapjoin.
It will end up overwriting all the functions of map-join, but keeping it this 
way keeps the hierarchy correct

 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838122#action_12838122
 ] 

He Yongqiang commented on HIVE-1194:


Yes. It does not need those storage. 
The main reason of letting it extend mapjoinop is because with that we can 
reuse the code for mapjoinop doing optimization and task generation.

 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838130#action_12838130
 ] 

Namit Jain commented on HIVE-1194:
--

A new optimization step will be created which will convert the mapjoin to a 
sortmergejoin

 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838132#action_12838132
 ] 

Zheng Shao commented on HIVE-1194:
--

If it does not inherit any methods, shall we add an AbstractMapJoinOperator as 
the common parent?
That AbstractMapJoinOperator can be converted to MapJoinOperator (or 
HashBasedMapJoinOperator, to be accurate) or SortMergeJoinOperator depending on 
the configuration/table properties.


 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1189) Add package-info.java to Hive

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838148#action_12838148
 ] 

Zheng Shao commented on HIVE-1189:
--

I am checking the BuildVersion which contains everything.
I need to think of a way to do a negative test.


 Add package-info.java to Hive
 -

 Key: HIVE-1189
 URL: https://issues.apache.org/jira/browse/HIVE-1189
 Project: Hadoop Hive
  Issue Type: New Feature
Affects Versions: 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.6.0

 Attachments: HIVE-1189.1.patch


 Hadoop automatically generates build/src/org/apache/hadoop/package-info.java 
 with information like this:
 {code}
 /*
  * Generated by src/saveVersion.sh
  */
 @HadoopVersionAnnotation(version=0.20.2-dev, revision=826568,
  user=zshao, date=Sun Oct 18 17:46:56 PDT 2009, 
 url=http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20;)
 package org.apache.hadoop;
 {code}
 Hive should do the same thing so that we can easily know the version of the 
 code at runtime.
 This will help us identify whether we are still running the same version of 
 Hive, if we serialize the plan and later continue the execution (See 
 HIVE-1100).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Commented: (HIVE-1032) Better Error Messages for Execution Errors

2010-02-24 Thread Paul Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838149#action_12838149
 ] 

Paul Yang commented on HIVE-1032:
-

Because this patch uses features of HIVE-873, this will not work with hadoop 
0.17. If you want, I can send you the broken queries I used to test on 0.20.

 Better Error Messages for Execution Errors
 --

 Key: HIVE-1032
 URL: https://issues.apache.org/jira/browse/HIVE-1032
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.6.0
Reporter: Paul Yang
Assignee: Paul Yang
 Attachments: HIVE-1032.1.patch, HIVE-1032.2.patch, HIVE-1032.3.patch, 
 HIVE-1032.4.patch, HIVE-1032.5.patch


 Three common errors that occur during execution are:
 1. Map-side group-by causing an out of memory exception due to large 
 aggregation hash tables
 2. ScriptOperator failing due to the user's script throwing an exception or 
 otherwise returning a non-zero error code
 3. Incorrectly specifying the join order of small and large tables, causing 
 the large table to be loaded into memory and producing an out of memory 
 exception.
 These errors are typically discovered by manually examining the error log 
 files of the failed task. This task proposes to create a feature that would 
 automatically read the error logs and output a probable cause and solution to 
 the command line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1198) When checkstyle is activated for Hive in Eclipse environment, it shows all checkstyle problems as errors.

2010-02-24 Thread Arvind Prabhakar (JIRA)
When checkstyle is activated for Hive in Eclipse environment, it shows all 
checkstyle problems as errors.
-

 Key: HIVE-1198
 URL: https://issues.apache.org/jira/browse/HIVE-1198
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Build Infrastructure
 Environment: Mac OS X (10.6.2), Eclipse 3.5.1.R35, Checkstyle Plugin 
5.1.0.201002232103 (latest eclipse and checkstyle build as of 02/2010)
Reporter: Arvind Prabhakar
Priority: Minor


As of now, checkstyle plugin reports all problems as errors. This causes an 
overwhelming number of errors to show up (3000+) which masks real errors that 
might be there. Since all the checkstyle violations are not going to be fixed 
in one shot, it is desirable to lower the severity of checkstyle violations to 
warnings so that the plugin can be kept enabled. This will encourage developers 
to spot checkstyle violations in the files they touch and potentially fix them 
as they go along, along with pointing out violations as they code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1032) Better Error Messages for Execution Errors

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838156#action_12838156
 ] 

Zheng Shao commented on HIVE-1032:
--

That makes sense to me. As long as it's compilable with 0.17 it should be OK.

Sorry there is another last thing :) Can you run ant checkstyle and fix the 
checkstyle warnings introduced by this patch (especially in the new files).

 Better Error Messages for Execution Errors
 --

 Key: HIVE-1032
 URL: https://issues.apache.org/jira/browse/HIVE-1032
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.6.0
Reporter: Paul Yang
Assignee: Paul Yang
 Attachments: HIVE-1032.1.patch, HIVE-1032.2.patch, HIVE-1032.3.patch, 
 HIVE-1032.4.patch, HIVE-1032.5.patch


 Three common errors that occur during execution are:
 1. Map-side group-by causing an out of memory exception due to large 
 aggregation hash tables
 2. ScriptOperator failing due to the user's script throwing an exception or 
 otherwise returning a non-zero error code
 3. Incorrectly specifying the join order of small and large tables, causing 
 the large table to be loaded into memory and producing an out of memory 
 exception.
 These errors are typically discovered by manually examining the error log 
 files of the failed task. This task proposes to create a feature that would 
 automatically read the error logs and output a probable cause and solution to 
 the command line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function

2010-02-24 Thread Jerome Boulon (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838173#action_12838173
 ] 

Jerome Boulon commented on HIVE-259:


- From my point of view, changing variable access to private in the state 
object will not make the code more readable ...
- I'll change all variables to be lowerCase to match java style, current 
variable's name are based on Oracle definition.

@Zheng - I'm not using an ArrayListInteger but a String to avoid unnecessary 
object creation (for every single row) ... would even be better if the 
constructor could have been used but I haven't found how to do that. If we care 
about 1 extra empty arrayList per mapper/spill in memory then we should care 
about creating (1 ArrayList + 1 Integer Object per percentile) per row.

@Zheng - Regarding the test case that what I add in mind when I asked you, 
howto create my own table and that exactly the reason why I post Jb2.* files


 Add PERCENTILE aggregate function
 -

 Key: HIVE-259
 URL: https://issues.apache.org/jira/browse/HIVE-259
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Venky Iyer
Assignee: Jerome Boulon
 Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, 
 jb2.txt, Percentile.xlsx


 Compute atleast 25, 50, 75th percentiles

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1137) build references IVY_HOME incorrectly

2010-02-24 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838175#action_12838175
 ] 

John Sichi commented on HIVE-1137:
--

+1


 build references IVY_HOME incorrectly
 -

 Key: HIVE-1137
 URL: https://issues.apache.org/jira/browse/HIVE-1137
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Build Infrastructure
Affects Versions: 0.6.0
Reporter: John Sichi
Assignee: Carl Steinbach
 Fix For: 0.6.0

 Attachments: HIVE-1137.patch


 The build references env.IVY_HOME, but doesn't actually import env as it 
 should (via property environment=env/).
 It's not clear what the IVY_HOME reference is for since the build doesn't 
 even use ivy.home (instead, it installs under the build/ivy directory).
 It looks like someone copied bits and pieces from the Automatically section 
 here:
 http://ant.apache.org/ivy/history/latest-milestone/install.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-990) Incorporate CheckStyle into Hive's build.xml

2010-02-24 Thread Paul Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838177#action_12838177
 ] 

Paul Yang commented on HIVE-990:


By default, the VisibilityModifier catches protected variables 
(http://checkstyle.sf.net/config_design.html) Is the use of 'protected' 
discouraged? If so, what's the reason?

 Incorporate CheckStyle into Hive's build.xml
 

 Key: HIVE-990
 URL: https://issues.apache.org/jira/browse/HIVE-990
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Build Infrastructure
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Fix For: 0.6.0

 Attachments: checkstyle-errors.html, HIVE-990.patch


 Hadoop and Pig both have CheckStyle integrated into their build. This is 
 useful for catching
 a variety of errors as well as for enforcing a specific coding style and 
 maintaining good code hygiene.
 We just need to snatch Hadoop's checkstyle.xml and integrate it into Hive's 
 build.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-1195) Increase ObjectInspector[] length on demand

2010-02-24 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang resolved HIVE-1195.
--

Resolution: Fixed

Committed to 0.5.1 and trunk. Thanks Zheng!

 Increase ObjectInspector[] length on demand
 ---

 Key: HIVE-1195
 URL: https://issues.apache.org/jira/browse/HIVE-1195
 Project: Hadoop Hive
  Issue Type: Improvement
Affects Versions: 0.5.0, 0.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.5.1, 0.6.0

 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, 
 HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch


 {code}
 Operator.java
   protected transient ObjectInspector[] inputObjInspectors = new 
 ObjectInspector[Short.MAX_VALUE];
 {code}
 An array of 32K elements takes 256KB memory under 64-bit Java.
 We are seeing hive client going out of memory because of that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1194) sorted merge join

2010-02-24 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838191#action_12838191
 ] 

He Yongqiang commented on HIVE-1194:


Thanks Zheng. Yes, we should do that.

 sorted merge join
 -

 Key: HIVE-1194
 URL: https://issues.apache.org/jira/browse/HIVE-1194
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Fix For: 0.6.0


 If the input tables are sorted on the join key, and a mapjoin is being 
 performed, it is useful to exploit the sorted properties of the table.
 This can lead to substantial cpu savings - this needs to work across bucketed 
 map joins also.
 Since, sorted properties of a table are not enforced currently, a new 
 parameter can be added to specify to use the sort-merge join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-990) Incorporate CheckStyle into Hive's build.xml

2010-02-24 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838193#action_12838193
 ] 

Carl Steinbach commented on HIVE-990:
-

Quoting from http://g.oswego.edu/dl/html/javaCodingStd.html:

??Minimize direct internal access to instance variables inside methods. Use 
protected access and update methods instead (or sometimes public ones if they 
exist anyway).??

??Rationale: While inconvenient and sometimes overkill, this allows you to vary 
synchronization and notification policies associated with variable access and 
change in the class and/or its subclasses, which is otherwise a serious 
impediment to extensiblity in concurrent OO programming.??

This advice is just as applicable in single-threaded situations. Declaring 
instance variables as protected allows subclasses and classes within the same 
package to become tightly-coupled to the specifics of your class's 
implementation. This violates the whole point of encapsulation.

For other problems associated with protected instance variables read this: 
http://java.sys-con.com/node/46344


 Incorporate CheckStyle into Hive's build.xml
 

 Key: HIVE-990
 URL: https://issues.apache.org/jira/browse/HIVE-990
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Build Infrastructure
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Fix For: 0.6.0

 Attachments: checkstyle-errors.html, HIVE-990.patch


 Hadoop and Pig both have CheckStyle integrated into their build. This is 
 useful for catching
 a variety of errors as well as for enforcing a specific coding style and 
 maintaining good code hygiene.
 We just need to snatch Hadoop's checkstyle.xml and integrate it into Hive's 
 build.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.