[ANNOUNCE] Hive 0.5.0 released
Hi folks, We have released Hive 0.5.0. You can find it from the download page in 24 hours (still waiting to be mirrored) http://hadoop.apache.org/hive/releases.html#Download -- Yours, Zheng
Build failed in Hudson: Hive-trunk-h0.20 #198
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/198/ -- [...truncated 13323 lines...] [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_function4.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_function4.q.out [junit] Done query: unknown_function4.q [junit] Begin query: unknown_table1.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_table1.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_table1.q.out [junit] Done query: unknown_table1.q [junit] Begin query: unknown_table2.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK:
[jira] Created: (HIVE-1193) ensure sorting properties for a table
ensure sorting properties for a table - Key: HIVE-1193 URL: https://issues.apache.org/jira/browse/HIVE-1193 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Fix For: 0.6.0 If a table is sorted, and data is being inserted into that - currently, we dont make sure that data is sorted. That might be useful some downstream operations. This cannot be made the default due to backward compatibility, but an option can be added for the same -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1194) sorted merge join
sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1195) Increase ObjectInspector[] length on demand
Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Shao updated HIVE-1195: - Attachment: HIVE-1195.1.patch Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Attachments: HIVE-1195.1.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Shao updated HIVE-1195: - Fix Version/s: 0.6.0 0.5.1 Status: Patch Available (was: Open) Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195.1.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838080#action_12838080 ] Ning Zhang commented on HIVE-1195: -- +1 Will commit after tests. Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195.1.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang updated HIVE-1195: - Attachment: HIVE-1195-branch-0.5.patch Uploading a patch for branch 0.5. Zheng, can you double check? Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation
[ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838095#action_12838095 ] Carl Steinbach commented on HIVE-535: - The folks working on Mahout seem to think the CERN license is compatible with Apache. They have already imported cern.colt*, cern.jet* and cern.clhep into their source tree. See MAHOUT-222. Check out the update to their LICENSE.txt file: http://svn.apache.org/repos/asf/lucene/mahout/trunk/LICENSE.txt Memory-efficient hash-based Aggregation --- Key: HIVE-535 URL: https://issues.apache.org/jira/browse/HIVE-535 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.4.0 Reporter: Zheng Shao Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator. The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result. Here are some initial thoughts (some of them are from Joydeep long time ago): A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial). A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry. A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java. More ideas are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1196) Railroad Diagrams for Hive Language Manual
Railroad Diagrams for Hive Language Manual -- Key: HIVE-1196 URL: https://issues.apache.org/jira/browse/HIVE-1196 Project: Hadoop Hive Issue Type: Task Components: Documentation Reporter: Carl Steinbach Priority: Minor Add railroad diagrams (syntax diagrams) to the Hive Language Manual. * The [ANTLRWorks IDE|http://www.antlr.org/works/index.html] generates railroad diagrams and allows you to export them as EPS. * [Clapham|http://sourceforge.net/projects/clapham/] is another tool for generating railroad diagrams based on BNF style inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Shao updated HIVE-1195: - Attachment: HIVE-1195.2.patch HIVE-1195.2.branch-0.5.patch Fixed an obvious bug which caused unit test failures. Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838112#action_12838112 ] Ning Zhang commented on HIVE-1195: -- Zheng, join26.q , join_map_ppr.q , union16.q, union9.q, failed on trunk. Can you take a look? Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang updated HIVE-1195: - Status: Open (was: Patch Available) Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838113#action_12838113 ] Namit Jain commented on HIVE-1194: -- Based on a offline discussion with Yongqiang, we were thinking of the following: There will be a new mapping in MapredWork - Operator - MapredLocalWork This will be populated for SortMergeJoinOperator only. SortMergeJoinOperator is a new operator which extends MapJoinOperator, and has the same name as a MapJoinOperator. MapJoinProcessor needs to create a SortMergeJoinOperator instead of a MapJoinOperator when it sees the new configuration parameter. MapJoinFactory methods need to change to create Operator-MapredLocalWork instead of MapredLocalWork in MapredWork. sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838114#action_12838114 ] Ning Zhang commented on HIVE-1195: -- Cool. I'll take the new patches to test. Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1197) create a new input format where a mapper spans a file
create a new input format where a mapper spans a file - Key: HIVE-1197 URL: https://issues.apache.org/jira/browse/HIVE-1197 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.6.0 This will be needed for Sort merge joins. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838118#action_12838118 ] Zheng Shao commented on HIVE-259: - Also see http://wiki.apache.org/hadoop/Hive/HowToContribute#Coding_Convention Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838119#action_12838119 ] Zheng Shao commented on HIVE-259: - The test cases looks a bit too trivial or the results have problems? They always return the same number for the 3 different percentile values. Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838120#action_12838120 ] Zheng Shao commented on HIVE-1194: -- Why does SortMergeJoinOperator extends MapJoinOperator? It seems to me that SortMergeJoinOperator does NOTneed the in-memory/disk-backed HashMap that MapJoinOperator has, correct? sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838121#action_12838121 ] Namit Jain commented on HIVE-1194: -- Yes, but it happens on the mapper. It is a special type of mapjoin. It will end up overwriting all the functions of map-join, but keeping it this way keeps the hierarchy correct sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838122#action_12838122 ] He Yongqiang commented on HIVE-1194: Yes. It does not need those storage. The main reason of letting it extend mapjoinop is because with that we can reuse the code for mapjoinop doing optimization and task generation. sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838130#action_12838130 ] Namit Jain commented on HIVE-1194: -- A new optimization step will be created which will convert the mapjoin to a sortmergejoin sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838132#action_12838132 ] Zheng Shao commented on HIVE-1194: -- If it does not inherit any methods, shall we add an AbstractMapJoinOperator as the common parent? That AbstractMapJoinOperator can be converted to MapJoinOperator (or HashBasedMapJoinOperator, to be accurate) or SortMergeJoinOperator depending on the configuration/table properties. sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1189) Add package-info.java to Hive
[ https://issues.apache.org/jira/browse/HIVE-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838148#action_12838148 ] Zheng Shao commented on HIVE-1189: -- I am checking the BuildVersion which contains everything. I need to think of a way to do a negative test. Add package-info.java to Hive - Key: HIVE-1189 URL: https://issues.apache.org/jira/browse/HIVE-1189 Project: Hadoop Hive Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.6.0 Attachments: HIVE-1189.1.patch Hadoop automatically generates build/src/org/apache/hadoop/package-info.java with information like this: {code} /* * Generated by src/saveVersion.sh */ @HadoopVersionAnnotation(version=0.20.2-dev, revision=826568, user=zshao, date=Sun Oct 18 17:46:56 PDT 2009, url=http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20;) package org.apache.hadoop; {code} Hive should do the same thing so that we can easily know the version of the code at runtime. This will help us identify whether we are still running the same version of Hive, if we serialize the plan and later continue the execution (See HIVE-1100). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1032) Better Error Messages for Execution Errors
[ https://issues.apache.org/jira/browse/HIVE-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838149#action_12838149 ] Paul Yang commented on HIVE-1032: - Because this patch uses features of HIVE-873, this will not work with hadoop 0.17. If you want, I can send you the broken queries I used to test on 0.20. Better Error Messages for Execution Errors -- Key: HIVE-1032 URL: https://issues.apache.org/jira/browse/HIVE-1032 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.6.0 Reporter: Paul Yang Assignee: Paul Yang Attachments: HIVE-1032.1.patch, HIVE-1032.2.patch, HIVE-1032.3.patch, HIVE-1032.4.patch, HIVE-1032.5.patch Three common errors that occur during execution are: 1. Map-side group-by causing an out of memory exception due to large aggregation hash tables 2. ScriptOperator failing due to the user's script throwing an exception or otherwise returning a non-zero error code 3. Incorrectly specifying the join order of small and large tables, causing the large table to be loaded into memory and producing an out of memory exception. These errors are typically discovered by manually examining the error log files of the failed task. This task proposes to create a feature that would automatically read the error logs and output a probable cause and solution to the command line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1198) When checkstyle is activated for Hive in Eclipse environment, it shows all checkstyle problems as errors.
When checkstyle is activated for Hive in Eclipse environment, it shows all checkstyle problems as errors. - Key: HIVE-1198 URL: https://issues.apache.org/jira/browse/HIVE-1198 Project: Hadoop Hive Issue Type: Improvement Components: Build Infrastructure Environment: Mac OS X (10.6.2), Eclipse 3.5.1.R35, Checkstyle Plugin 5.1.0.201002232103 (latest eclipse and checkstyle build as of 02/2010) Reporter: Arvind Prabhakar Priority: Minor As of now, checkstyle plugin reports all problems as errors. This causes an overwhelming number of errors to show up (3000+) which masks real errors that might be there. Since all the checkstyle violations are not going to be fixed in one shot, it is desirable to lower the severity of checkstyle violations to warnings so that the plugin can be kept enabled. This will encourage developers to spot checkstyle violations in the files they touch and potentially fix them as they go along, along with pointing out violations as they code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1032) Better Error Messages for Execution Errors
[ https://issues.apache.org/jira/browse/HIVE-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838156#action_12838156 ] Zheng Shao commented on HIVE-1032: -- That makes sense to me. As long as it's compilable with 0.17 it should be OK. Sorry there is another last thing :) Can you run ant checkstyle and fix the checkstyle warnings introduced by this patch (especially in the new files). Better Error Messages for Execution Errors -- Key: HIVE-1032 URL: https://issues.apache.org/jira/browse/HIVE-1032 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.6.0 Reporter: Paul Yang Assignee: Paul Yang Attachments: HIVE-1032.1.patch, HIVE-1032.2.patch, HIVE-1032.3.patch, HIVE-1032.4.patch, HIVE-1032.5.patch Three common errors that occur during execution are: 1. Map-side group-by causing an out of memory exception due to large aggregation hash tables 2. ScriptOperator failing due to the user's script throwing an exception or otherwise returning a non-zero error code 3. Incorrectly specifying the join order of small and large tables, causing the large table to be loaded into memory and producing an out of memory exception. These errors are typically discovered by manually examining the error log files of the failed task. This task proposes to create a feature that would automatically read the error logs and output a probable cause and solution to the command line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838173#action_12838173 ] Jerome Boulon commented on HIVE-259: - From my point of view, changing variable access to private in the state object will not make the code more readable ... - I'll change all variables to be lowerCase to match java style, current variable's name are based on Oracle definition. @Zheng - I'm not using an ArrayListInteger but a String to avoid unnecessary object creation (for every single row) ... would even be better if the constructor could have been used but I haven't found how to do that. If we care about 1 extra empty arrayList per mapper/spill in memory then we should care about creating (1 ArrayList + 1 Integer Object per percentile) per row. @Zheng - Regarding the test case that what I add in mind when I asked you, howto create my own table and that exactly the reason why I post Jb2.* files Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259.1.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1137) build references IVY_HOME incorrectly
[ https://issues.apache.org/jira/browse/HIVE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838175#action_12838175 ] John Sichi commented on HIVE-1137: -- +1 build references IVY_HOME incorrectly - Key: HIVE-1137 URL: https://issues.apache.org/jira/browse/HIVE-1137 Project: Hadoop Hive Issue Type: Bug Components: Build Infrastructure Affects Versions: 0.6.0 Reporter: John Sichi Assignee: Carl Steinbach Fix For: 0.6.0 Attachments: HIVE-1137.patch The build references env.IVY_HOME, but doesn't actually import env as it should (via property environment=env/). It's not clear what the IVY_HOME reference is for since the build doesn't even use ivy.home (instead, it installs under the build/ivy directory). It looks like someone copied bits and pieces from the Automatically section here: http://ant.apache.org/ivy/history/latest-milestone/install.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-990) Incorporate CheckStyle into Hive's build.xml
[ https://issues.apache.org/jira/browse/HIVE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838177#action_12838177 ] Paul Yang commented on HIVE-990: By default, the VisibilityModifier catches protected variables (http://checkstyle.sf.net/config_design.html) Is the use of 'protected' discouraged? If so, what's the reason? Incorporate CheckStyle into Hive's build.xml Key: HIVE-990 URL: https://issues.apache.org/jira/browse/HIVE-990 Project: Hadoop Hive Issue Type: Improvement Components: Build Infrastructure Reporter: Carl Steinbach Assignee: Carl Steinbach Fix For: 0.6.0 Attachments: checkstyle-errors.html, HIVE-990.patch Hadoop and Pig both have CheckStyle integrated into their build. This is useful for catching a variety of errors as well as for enforcing a specific coding style and maintaining good code hygiene. We just need to snatch Hadoop's checkstyle.xml and integrate it into Hive's build.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1195) Increase ObjectInspector[] length on demand
[ https://issues.apache.org/jira/browse/HIVE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang resolved HIVE-1195. -- Resolution: Fixed Committed to 0.5.1 and trunk. Thanks Zheng! Increase ObjectInspector[] length on demand --- Key: HIVE-1195 URL: https://issues.apache.org/jira/browse/HIVE-1195 Project: Hadoop Hive Issue Type: Improvement Affects Versions: 0.5.0, 0.6.0 Reporter: Zheng Shao Assignee: Zheng Shao Fix For: 0.5.1, 0.6.0 Attachments: HIVE-1195-branch-0.5.patch, HIVE-1195.1.patch, HIVE-1195.2.branch-0.5.patch, HIVE-1195.2.patch {code} Operator.java protected transient ObjectInspector[] inputObjInspectors = new ObjectInspector[Short.MAX_VALUE]; {code} An array of 32K elements takes 256KB memory under 64-bit Java. We are seeing hive client going out of memory because of that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838191#action_12838191 ] He Yongqiang commented on HIVE-1194: Thanks Zheng. Yes, we should do that. sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-990) Incorporate CheckStyle into Hive's build.xml
[ https://issues.apache.org/jira/browse/HIVE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838193#action_12838193 ] Carl Steinbach commented on HIVE-990: - Quoting from http://g.oswego.edu/dl/html/javaCodingStd.html: ??Minimize direct internal access to instance variables inside methods. Use protected access and update methods instead (or sometimes public ones if they exist anyway).?? ??Rationale: While inconvenient and sometimes overkill, this allows you to vary synchronization and notification policies associated with variable access and change in the class and/or its subclasses, which is otherwise a serious impediment to extensiblity in concurrent OO programming.?? This advice is just as applicable in single-threaded situations. Declaring instance variables as protected allows subclasses and classes within the same package to become tightly-coupled to the specifics of your class's implementation. This violates the whole point of encapsulation. For other problems associated with protected instance variables read this: http://java.sys-con.com/node/46344 Incorporate CheckStyle into Hive's build.xml Key: HIVE-990 URL: https://issues.apache.org/jira/browse/HIVE-990 Project: Hadoop Hive Issue Type: Improvement Components: Build Infrastructure Reporter: Carl Steinbach Assignee: Carl Steinbach Fix For: 0.6.0 Attachments: checkstyle-errors.html, HIVE-990.patch Hadoop and Pig both have CheckStyle integrated into their build. This is useful for catching a variety of errors as well as for enforcing a specific coding style and maintaining good code hygiene. We just need to snatch Hadoop's checkstyle.xml and integrate it into Hive's build.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.