[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927312#action_12927312 ] Namit Jain commented on HIVE-1750: -- OpProcFactory: for (Partition p: prunedPartList.getConfirmedPartns()) { if (!p.getTable().isPartitioned()) { return null; } } for (Partition p: prunedPartList.getUnknownPartns()) { if (!p.getTable().isPartitioned()) { return null; } } Why are the above changes needed ? The overall approach looks good - still looking in detail. Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Anyway in hive to measure query performance.
Hi, I was wondering if there is anyway in hive that can be used to measure the performance of variour components/operations of a single query run. Eg. Typecally query involvs various operations like tablescan, joins, aggregation, orderby etc. Can I get how much time was required for each of this? Also how do you measure hadoop-cluster performance as far as hive query/load run is concerned ? -- Best Regards, Prafulla V Tekawade
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927334#action_12927334 ] Amareshwari Sriramadasu commented on HIVE-1750: --- A couple of minor comments: * javadoc for isOpNot and isOpOr in FunctionRegistry is wrong. Do you want to correct it? * For the below code change in many optimizers : {code} + prunedParts = pGraphContext.getOpToPartList().get(tso); + if (prunedParts == null) { +prunedParts = PartitionPruner.prune(); + } {code} I was expecting a pGraphContext.getOpToPartList().put(). Is PartitonPruner.prune call really needed in all those places, because PartitionConditionRemover already does a put if it is not null ? Correct me if I'm wrong. Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1761) Support show locks for a particular table
[ https://issues.apache.org/jira/browse/HIVE-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1761: --- Resolution: Fixed Status: Resolved (was: Patch Available) I just committed! Thanks Namit! Support show locks for a particular table - Key: HIVE-1761 URL: https://issues.apache.org/jira/browse/HIVE-1761 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1761.1.patch Currently, only show locks is supported - it would be very useful to show locks for a particular table -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain reassigned HIVE-1721: Assignee: Siying Dong (was: Liyin Tang) use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[ANNOUNCE] New Committer - Carl Steinbach
Hi Folks, The Hive PMC has passed the vote to make Carl Steinbach a new committer on the Apache Hive project. Carl has made a lot of contributions to Hive with the latest being him serving as the release manager for 0.6.0 release. Following is a list of some of the contributions that he has made to the project: http://bit.ly/bu5rHq Congratulations Carl!! Please send over your CLA to Apache. Thanks, Ashish
Re: [ANNOUNCE] New Committer - Carl Steinbach
On Tue, Nov 2, 2010 at 2:23 PM, Ashish Thusoo athu...@facebook.com wrote: Hi Folks, The Hive PMC has passed the vote to make Carl Steinbach a new committer on the Apache Hive project. Carl has made a lot of contributions to Hive with the latest being him serving as the release manager for 0.6.0 release. Following is a list of some of the contributions that he has made to the project: http://bit.ly/bu5rHq Congratulations Carl!! Please send over your CLA to Apache. Thanks, Ashish Carl, Congrats. Nice to have you aboard. Edward
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927533#action_12927533 ] Siying Dong commented on HIVE-1750: --- Amareshwari, it's a good catch. I'll make a put there. Will submit a patch later. Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927535#action_12927535 ] Siying Dong commented on HIVE-1750: --- In the case that at least one partition is a table, the result can be unpredictable. I return for the corner case since I think it is safer. Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Hive-trunk-h0.20 #410
See https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/410/ -- [...truncated 15162 lines...] [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.seq [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/complex.seq [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/json.txt [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_table1.q.out https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_table1.q.out [junit] Done query: unknown_table1.q [junit] Begin query: unknown_table2.q [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table srcpart partition (ds=2008-04-08, hr=11) [junit] rmr: cannot remove phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-08/hr=11: No such file or directory. [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table srcpart partition (ds=2008-04-08, hr=12) [junit] rmr: cannot remove phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-08/hr=12: No such file or directory. [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table srcpart partition (ds=2008-04-09, hr=11) [junit] rmr: cannot remove phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-09/hr=11: No such file or directory. [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table srcpart partition (ds=2008-04-09, hr=12) [junit] rmr: cannot remove phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-09/hr=12: No such file or directory. [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket0.txt [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket1.txt [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket20.txt [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket21.txt [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket22.txt [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket23.txt [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Copying data from https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv3.txt [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Copying data from
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927545#action_12927545 ] Namit Jain commented on HIVE-1721: -- T2 does not fit in memory completely. We create a bloom filter for T2, which fits in memory - the assumption here is that by filtering out a lot of rows from T1, we are reducing the number of rows that go to the reducer substantially, which helps the join performance use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1526) Hive should depend on a release version of Thrift
[ https://issues.apache.org/jira/browse/HIVE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927549#action_12927549 ] Pradeep Kamath commented on HIVE-1526: -- Hi Carl - just wondering if you have had a chance to look at this - a new patch for this issue will help me create a patch for HIVE-1696 (I suspect we will need to redo HIVE-842 as well - I can take a stab at that once this patch is ready). Hive should depend on a release version of Thrift - Key: HIVE-1526 URL: https://issues.apache.org/jira/browse/HIVE-1526 Project: Hive Issue Type: Task Components: Build Infrastructure, Clients Reporter: Carl Steinbach Assignee: Todd Lipcon Fix For: 0.7.0 Attachments: HIVE-1526.2.patch.txt, hive-1526.txt, libfb303.jar, libthrift.jar Hive should depend on a release version of Thrift, and ideally it should use Ivy to resolve this dependency. The Thrift folks are working on adding Thrift artifacts to a maven repository here: https://issues.apache.org/jira/browse/THRIFT-363 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927551#action_12927551 ] Siying Dong commented on HIVE-1721: --- It is a common use case? Small table is so big that it doesn't even fit in memory, but most rows in big table don't match any of those keys. use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927553#action_12927553 ] Namit Jain commented on HIVE-1721: -- Yes, even after all the optimizations, map-join is restricted to tables ~25M. There are lots of scenarios when the small table is ~100M use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: Anyway in hive to measure query performance.
We are still building infrastructure to make performance optimizing easier, but for now, all the measurements are kind of manual. Especially to the component/operations level, we don't have a good tool to tell it yet. What we are doing now, is to select some typical benchmark queries that cover some simple use cases. We have performance base number for it (we focus on CPU cycles since it is relatively stable) and then we run simple Java's profiler to see which components can be optimized, implement the improvement and run it against the same set of benchmark queries (on the same environment) and verify whather we see the improvements we expect happen. We try to isolate Hive's execution performance from factors by Hadoop. We do concern hadoop-cluster performance in the context of Hive queries and we optimize it separately. -Original Message- From: prafulla.tekaw...@gmail.com [mailto:prafulla.tekaw...@gmail.com] On Behalf Of Prafulla Tekawade Sent: Monday, November 01, 2010 11:06 PM To: hive-...@hadoop.apache.org Subject: Anyway in hive to measure query performance. Hi, I was wondering if there is anyway in hive that can be used to measure the performance of variour components/operations of a single query run. Eg. Typecally query involvs various operations like tablescan, joins, aggregation, orderby etc. Can I get how much time was required for each of this? Also how do you measure hadoop-cluster performance as far as hive query/load run is concerned ? -- Best Regards, Prafulla V Tekawade
Re: [ANNOUNCE] New Committer - Carl Steinbach
Congrats Carl. On Tue, Nov 2, 2010 at 11:27 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Nov 2, 2010 at 2:23 PM, Ashish Thusoo athu...@facebook.com wrote: Hi Folks, The Hive PMC has passed the vote to make Carl Steinbach a new committer on the Apache Hive project. Carl has made a lot of contributions to Hive with the latest being him serving as the release manager for 0.6.0 release. Following is a list of some of the contributions that he has made to the project: http://bit.ly/bu5rHq Congratulations Carl!! Please send over your CLA to Apache. Thanks, Ashish Carl, Congrats. Nice to have you aboard. Edward
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927555#action_12927555 ] Joydeep Sen Sarma commented on HIVE-1721: - @Siyin - that's a good question. I don't know statistically how common it is - but we have heard requests along these lines. for example one use case is that one project wants to get some data for a reasonably large subset of the users. one use case we have seen was where 0.2% of users were interesting - but even 0.2% is very large for us. people also use semi-joins and that pretty much says that people want to filter rows out. use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927567#action_12927567 ] Siying Dong commented on HIVE-1721: --- So the idea is, the filtered rows in the big table fit in memory so that we can sort them and pay sequential I/O to read the small table back? Or we do external sort for the filtered rows from the big table? use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927568#action_12927568 ] Namit Jain commented on HIVE-1721: -- That depends on the size of the filtered big table: To start with, we can do a join of the small table with the filtered big table using the current infrastructure. We may need some special tricks for outer joins, but it should be possible use bloom filters to improve the performance of joins - Key: HIVE-1721 URL: https://issues.apache.org/jira/browse/HIVE-1721 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong In case of map-joins, it is likely that the big table will not find many matching rows from the small table. Currently, we perform a hash-map lookup for every row in the big table, which can be pretty expensive. It might be useful to try out a bloom-filter containing all the elements in the small table. Each element from the big table is first searched in the bloom filter, and only in case of a positive match, the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1763) drop table (or view) should issue warning if table doesn't exist
drop table (or view) should issue warning if table doesn't exist Key: HIVE-1763 URL: https://issues.apache.org/jira/browse/HIVE-1763 Project: Hive Issue Type: Improvement Components: Metastore Reporter: dan f Priority: Minor drop table reports OK even if the table doesn't exist. Better to report something like mysql's Unknown table 'foo' so that, e.g., unwanted tables (especially ones with names prone to typos) don't persist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1332) Archiving partitions
[ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927627#action_12927627 ] Paul Yang commented on HIVE-1332: - Added archiving sections at: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Alter_Table_.28Un.29Archive http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving Archiving partitions Key: HIVE-1332 URL: https://issues.apache.org/jira/browse/HIVE-1332 Project: Hive Issue Type: New Feature Components: Metastore Reporter: Paul Yang Assignee: Paul Yang Fix For: 0.6.0 Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets. One way to drastically reduce the number of files is to use hadoop archives: http://hadoop.apache.org/common/docs/current/hadoop_archives.html This feature would introduce an ALTER TABLE table_name ARCHIVE PARTITION spec that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived. Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes: https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix) https://issues.apache.org/jira/browse/HADOOP-6645 https://issues.apache.org/jira/browse/MAPREDUCE-1585 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927697#action_12927697 ] Siying Dong commented on HIVE-1750: --- Namit, sorry I misunderstood. Yes, maybe evalExprWithPart() can share some codes with PartitionPruner. Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [ANNOUNCE] New Committer - Carl Steinbach
Congrats, Carl! Great work -Original Message- From: Ashish Thusoo [mailto:athu...@facebook.com] Sent: Tuesday, November 02, 2010 11:23 AM To: dev@hive.apache.org Subject: [ANNOUNCE] New Committer - Carl Steinbach Hi Folks, The Hive PMC has passed the vote to make Carl Steinbach a new committer on the Apache Hive project. Carl has made a lot of contributions to Hive with the latest being him serving as the release manager for 0.6.0 release. Following is a list of some of the contributions that he has made to the project: http://bit.ly/bu5rHq Congratulations Carl!! Please send over your CLA to Apache. Thanks, Ashish
[jira] Updated: (HIVE-1501) when generating reentrant INSERT for index rebuild, quote identifiers using backticks
[ https://issues.apache.org/jira/browse/HIVE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skye Berghel updated HIVE-1501: --- Status: Patch Available (was: Open) when generating reentrant INSERT for index rebuild, quote identifiers using backticks - Key: HIVE-1501 URL: https://issues.apache.org/jira/browse/HIVE-1501 Project: Hive Issue Type: Bug Components: Indexing Affects Versions: 0.7.0 Reporter: John Sichi Assignee: Skye Berghel Fix For: 0.7.0 Attachments: 1501.patch, 1501_new_tests.patch, 1501_with_tests.patch, HIVE-1501.4.patch Yongqiang, you mentioned that you weren't able to do this due to SORT BY not accepting them. The SORT BY is gone now as of HIVE-1494 (and SORT BY needs to be fixed anyway). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1497) support COMMENT clause on CREATE INDEX, and add new commands for SHOW/DESCRIBE indexes
[ https://issues.apache.org/jira/browse/HIVE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Melick updated HIVE-1497: - Attachment: HIVE-1497.4.patch support COMMENT clause on CREATE INDEX, and add new commands for SHOW/DESCRIBE indexes -- Key: HIVE-1497 URL: https://issues.apache.org/jira/browse/HIVE-1497 Project: Hive Issue Type: Improvement Components: Indexing Affects Versions: 0.7.0 Reporter: John Sichi Assignee: Russell Melick Fix For: 0.7.0 Attachments: HIVE-1497.4.patch, hive-1497.p1.patch, hive-1497.p2.patch, hive-1497.p3.patch We need to work out the syntax for SHOW/DESCRIBE, taking partitioning into account. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927735#action_12927735 ] Namit Jain commented on HIVE-1750: -- The code changes look good to me. Can you add some tests and do a explain plan for all kinds of scenarios: ds 10 and x 5 ds 10 or x 5 ds 10 and x 5 and y 10 (ds 10 and x 5) or (ds 10 and y 5) (ds 10 and x 5) or (ds 5 and y 5) Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927737#action_12927737 ] Namit Jain commented on HIVE-1750: -- Amareshwari, can you also confirm the changes ? Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (HIVE-1750) Remove Partition Filtering Conditions when Possible
[ https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927735#action_12927735 ] Namit Jain edited comment on HIVE-1750 at 11/3/10 12:58 AM: The code changes look good to me. Can you add some tests and do a explain plan for all kinds of scenarios: ds 10 and x 5 ds 10 or x 5 ds 10 and x 5 and y 10 (ds 10 and x 5) or (ds 10 and y 5) (ds 10 and x 5) or (ds 5 and y 5) (ds 10 or x 5) and (ds 5 or y 5) was (Author: namit): The code changes look good to me. Can you add some tests and do a explain plan for all kinds of scenarios: ds 10 and x 5 ds 10 or x 5 ds 10 and x 5 and y 10 (ds 10 and x 5) or (ds 10 and y 5) (ds 10 and x 5) or (ds 5 and y 5) Remove Partition Filtering Conditions when Possible --- Key: HIVE-1750 URL: https://issues.apache.org/jira/browse/HIVE-1750 Project: Hive Issue Type: Improvement Reporter: Siying Dong Assignee: Siying Dong Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch For some simple queries, partition filtering constraints take 8% of CPU time (now 16% since we filter twice) even if the result is always true. When possible, we should remove these constraints to save CPU times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.