[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927312#action_12927312
 ] 

Namit Jain commented on HIVE-1750:
--

OpProcFactory:

  for (Partition p: prunedPartList.getConfirmedPartns()) {
if (!p.getTable().isPartitioned()) {
  return null;
}
  }
  for (Partition p: prunedPartList.getUnknownPartns()) {
if (!p.getTable().isPartitioned()) {
  return null;
}
  }

Why are the above changes needed ?


The overall approach looks good - still looking in detail.



 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Anyway in hive to measure query performance.

2010-11-02 Thread Prafulla Tekawade
Hi,
I was wondering if there is anyway in hive that can be used to
measure the performance of variour components/operations of a
single query run.
Eg.
Typecally query involvs various operations like tablescan, joins,
aggregation, orderby etc. Can I get how much time was required for
each of this?

Also how do you measure hadoop-cluster performance as far as hive
query/load run is concerned ?

-- 
Best Regards,
Prafulla V Tekawade


[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927334#action_12927334
 ] 

Amareshwari Sriramadasu commented on HIVE-1750:
---

A couple of minor comments:
* javadoc for isOpNot and isOpOr in FunctionRegistry is wrong. Do you want to 
correct it?
* For the below code change in many optimizers :
{code}
+  prunedParts = pGraphContext.getOpToPartList().get(tso);
+  if (prunedParts == null) {
+prunedParts = PartitionPruner.prune();
+  }
{code}
I was expecting a pGraphContext.getOpToPartList().put().
Is PartitonPruner.prune call really needed in all those places, because 
PartitionConditionRemover already does a put if it is not null ? Correct me if 
I'm wrong.

 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1761) Support show locks for a particular table

2010-11-02 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1761:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed! Thanks Namit!

 Support show locks for a particular table
 -

 Key: HIVE-1761
 URL: https://issues.apache.org/jira/browse/HIVE-1761
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Attachments: hive.1761.1.patch


 Currently, only show locks is supported - it would be very useful to show 
 locks for a particular table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain reassigned HIVE-1721:


Assignee: Siying Dong  (was: Liyin Tang)

 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[ANNOUNCE] New Committer - Carl Steinbach

2010-11-02 Thread Ashish Thusoo
Hi Folks,

The Hive PMC has passed the vote to make Carl Steinbach a new committer on the 
Apache Hive project. Carl has made a lot of contributions to Hive with the 
latest being him serving as the release manager for 0.6.0 release. Following is 
a list of some of the contributions that he has made to the project:

http://bit.ly/bu5rHq

Congratulations Carl!! Please send over your CLA to Apache.

Thanks,
Ashish



Re: [ANNOUNCE] New Committer - Carl Steinbach

2010-11-02 Thread Edward Capriolo
On Tue, Nov 2, 2010 at 2:23 PM, Ashish Thusoo athu...@facebook.com wrote:
 Hi Folks,

 The Hive PMC has passed the vote to make Carl Steinbach a new committer on 
 the Apache Hive project. Carl has made a lot of contributions to Hive with 
 the latest being him serving as the release manager for 0.6.0 release. 
 Following is a list of some of the contributions that he has made to the 
 project:

 http://bit.ly/bu5rHq

 Congratulations Carl!! Please send over your CLA to Apache.

 Thanks,
 Ashish



Carl,

Congrats. Nice to have you aboard.

Edward


[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927533#action_12927533
 ] 

Siying Dong commented on HIVE-1750:
---

Amareshwari, it's a good catch. I'll make a put there. Will submit a patch 
later.



 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927535#action_12927535
 ] 

Siying Dong commented on HIVE-1750:
---

In the case that at least one partition is a table, the result can be 
unpredictable. I return for the corner case since I think it is safer.



 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Hive-trunk-h0.20 #410

2010-11-02 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/410/

--
[...truncated 15162 lines...]
[junit] POSTHOOK: Output: defa...@src1
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.seq
[junit] Loading data to table src_sequencefile
[junit] POSTHOOK: Output: defa...@src_sequencefile
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/complex.seq
[junit] Loading data to table src_thrift
[junit] POSTHOOK: Output: defa...@src_thrift
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/json.txt
[junit] Loading data to table src_json
[junit] POSTHOOK: Output: defa...@src_json
[junit] OK
[junit] diff 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_table1.q.out
 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_table1.q.out
[junit] Done query: unknown_table1.q
[junit] Begin query: unknown_table2.q
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt
[junit] Loading data to table srcpart partition (ds=2008-04-08, hr=11)
[junit] rmr: cannot remove 
phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-08/hr=11:
 No such file or directory.
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt
[junit] Loading data to table srcpart partition (ds=2008-04-08, hr=12)
[junit] rmr: cannot remove 
phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-08/hr=12:
 No such file or directory.
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt
[junit] Loading data to table srcpart partition (ds=2008-04-09, hr=11)
[junit] rmr: cannot remove 
phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-09/hr=11:
 No such file or directory.
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt
[junit] Loading data to table srcpart partition (ds=2008-04-09, hr=12)
[junit] rmr: cannot remove 
phttps://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/data/warehouse/srcpart/ds=2008-04-09/hr=12:
 No such file or directory.
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket0.txt
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket1.txt
[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket20.txt
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket21.txt
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket22.txt
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/srcbucket23.txt
[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv1.txt
[junit] Loading data to table src
[junit] POSTHOOK: Output: defa...@src
[junit] OK
[junit] Copying data from 
https://hudson.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/data/files/kv3.txt
[junit] Loading data to table src1
[junit] POSTHOOK: Output: defa...@src1
[junit] OK
[junit] Copying data from 

[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927545#action_12927545
 ] 

Namit Jain commented on HIVE-1721:
--

T2 does not fit in memory completely.
We create a bloom filter for T2, which fits in memory - the assumption here is 
that by filtering out a lot of rows from T1, we 
are reducing the number of rows that go to the reducer substantially, which 
helps the join performance

 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1526) Hive should depend on a release version of Thrift

2010-11-02 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927549#action_12927549
 ] 

Pradeep Kamath commented on HIVE-1526:
--

Hi Carl - just wondering if you have had a chance to look at this - a new patch 
for this issue will help me create a patch for HIVE-1696 (I suspect we will 
need to redo HIVE-842 as well - I can take a stab at that once this patch is 
ready).

 Hive should depend on a release version of Thrift
 -

 Key: HIVE-1526
 URL: https://issues.apache.org/jira/browse/HIVE-1526
 Project: Hive
  Issue Type: Task
  Components: Build Infrastructure, Clients
Reporter: Carl Steinbach
Assignee: Todd Lipcon
 Fix For: 0.7.0

 Attachments: HIVE-1526.2.patch.txt, hive-1526.txt, libfb303.jar, 
 libthrift.jar


 Hive should depend on a release version of Thrift, and ideally it should use 
 Ivy to resolve this dependency.
 The Thrift folks are working on adding Thrift artifacts to a maven repository 
 here: https://issues.apache.org/jira/browse/THRIFT-363

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927551#action_12927551
 ] 

Siying Dong commented on HIVE-1721:
---

It is a common use case? Small table is so big that it doesn't even fit in 
memory, but most rows in big table don't match any of those keys.



 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927553#action_12927553
 ] 

Namit Jain commented on HIVE-1721:
--

Yes, even after all the optimizations, map-join is restricted to tables  ~25M.

There are lots of scenarios when the small table is ~100M

 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Anyway in hive to measure query performance.

2010-11-02 Thread Siying Dong
We are still building infrastructure to make performance optimizing easier, but 
for now, all the measurements are kind of manual.
Especially to the component/operations level, we don't have a good tool to tell 
it yet.

What we are doing now, is to select some typical benchmark queries that cover 
some simple use cases. We have performance base number for it (we focus on CPU 
cycles since it is relatively stable) and then we run simple Java's profiler to 
see which components can be optimized, implement the improvement and run it 
against the same set of benchmark queries (on the same environment) and 
verify whather we see the improvements we expect happen.

We try to isolate Hive's execution performance from factors by Hadoop. We do 
concern hadoop-cluster performance in the context of Hive queries and we 
optimize it separately.

-Original Message-
From: prafulla.tekaw...@gmail.com [mailto:prafulla.tekaw...@gmail.com] On 
Behalf Of Prafulla Tekawade
Sent: Monday, November 01, 2010 11:06 PM
To: hive-...@hadoop.apache.org
Subject: Anyway in hive to measure query performance.

Hi,
I was wondering if there is anyway in hive that can be used to
measure the performance of variour components/operations of a
single query run.
Eg.
Typecally query involvs various operations like tablescan, joins,
aggregation, orderby etc. Can I get how much time was required for
each of this?

Also how do you measure hadoop-cluster performance as far as hive
query/load run is concerned ?

-- 
Best Regards,
Prafulla V Tekawade


Re: [ANNOUNCE] New Committer - Carl Steinbach

2010-11-02 Thread yongqiang he
Congrats Carl.

On Tue, Nov 2, 2010 at 11:27 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Tue, Nov 2, 2010 at 2:23 PM, Ashish Thusoo athu...@facebook.com wrote:
 Hi Folks,

 The Hive PMC has passed the vote to make Carl Steinbach a new committer on 
 the Apache Hive project. Carl has made a lot of contributions to Hive with 
 the latest being him serving as the release manager for 0.6.0 release. 
 Following is a list of some of the contributions that he has made to the 
 project:

 http://bit.ly/bu5rHq

 Congratulations Carl!! Please send over your CLA to Apache.

 Thanks,
 Ashish



 Carl,

 Congrats. Nice to have you aboard.

 Edward



[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927555#action_12927555
 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-

@Siyin - that's a good question. I don't know statistically how common it is - 
but we have heard requests along these lines. for example one use case is that 
one project wants to get some data for a reasonably large subset of the users. 
one use case we have seen was where 0.2% of users were interesting - but even 
0.2% is very large for us. people also use semi-joins and that pretty much says 
that people want to filter rows out.

 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927567#action_12927567
 ] 

Siying Dong commented on HIVE-1721:
---

So the idea is, the filtered rows in the big table fit in memory so that we can 
sort them and pay sequential I/O to read the small table back? Or we do 
external sort for the filtered rows from the big table?



 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927568#action_12927568
 ] 

Namit Jain commented on HIVE-1721:
--

That depends on the size of the filtered big table:

To start with, we can do a join of the small table with the filtered big table 
using the current infrastructure.
We may need some special tricks for outer joins, but it should be possible

 use bloom filters to improve the performance of joins
 -

 Key: HIVE-1721
 URL: https://issues.apache.org/jira/browse/HIVE-1721
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Siying Dong

 In case of map-joins, it is likely that the big table will not find many 
 matching rows from the small table.
 Currently, we perform a hash-map lookup for every row in the big table, which 
 can be pretty expensive.
 It might be useful to try out a bloom-filter containing all the elements in 
 the small table.
 Each element from the big table is first searched in the bloom filter, and 
 only in case of a positive match,
 the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1763) drop table (or view) should issue warning if table doesn't exist

2010-11-02 Thread dan f (JIRA)
drop table (or view) should issue warning if table doesn't exist


 Key: HIVE-1763
 URL: https://issues.apache.org/jira/browse/HIVE-1763
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: dan f
Priority: Minor


drop table reports OK even if the table doesn't exist.  Better to report 
something like mysql's Unknown table 'foo' so that, e.g., unwanted tables 
(especially ones with names prone to typos) don't persist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1332) Archiving partitions

2010-11-02 Thread Paul Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927627#action_12927627
 ] 

Paul Yang commented on HIVE-1332:
-

Added archiving sections at:

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Alter_Table_.28Un.29Archive
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving

 Archiving partitions
 

 Key: HIVE-1332
 URL: https://issues.apache.org/jira/browse/HIVE-1332
 Project: Hive
  Issue Type: New Feature
  Components: Metastore
Reporter: Paul Yang
Assignee: Paul Yang
 Fix For: 0.6.0

 Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, 
 HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch


 Partitions and tables in Hive typically consist of many files on HDFS. An 
 issue is that as the number of files increase, there will be higher 
 memory/load requirements on the namenode. Partitions in bucketed tables are a 
 particular problem because they consist of many files, one for each of the 
 buckets.
 One way to drastically reduce the number of files is to use hadoop archives:
 http://hadoop.apache.org/common/docs/current/hadoop_archives.html
 This feature would introduce an ALTER TABLE table_name ARCHIVE PARTITION 
 spec that would automatically put the files for the partition into a HAR 
 file. We would also have an UNARCHIVE option to convert the files in the 
 partition back to the original files. Archived partitions would be slower to 
 access, but they would have the same functionality and decrease the number of 
 files drastically. Typically, only seldom accessed partitions would be 
 archived.
 Hadoop archives are still somewhat new, so we'll only put in support for the 
 latest released major version (0.20). Here are some bug fixes:
 https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could 
 potentially cause data loss without this fix)
 https://issues.apache.org/jira/browse/HADOOP-6645
 https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Siying Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927697#action_12927697
 ] 

Siying Dong commented on HIVE-1750:
---

Namit, sorry I misunderstood. Yes, maybe evalExprWithPart() can share some 
codes with PartitionPruner.



 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [ANNOUNCE] New Committer - Carl Steinbach

2010-11-02 Thread Paul Yang
Congrats, Carl! Great work

-Original Message-
From: Ashish Thusoo [mailto:athu...@facebook.com] 
Sent: Tuesday, November 02, 2010 11:23 AM
To: dev@hive.apache.org
Subject: [ANNOUNCE] New Committer - Carl Steinbach

Hi Folks,

The Hive PMC has passed the vote to make Carl Steinbach a new committer on the 
Apache Hive project. Carl has made a lot of contributions to Hive with the 
latest being him serving as the release manager for 0.6.0 release. Following is 
a list of some of the contributions that he has made to the project:

http://bit.ly/bu5rHq

Congratulations Carl!! Please send over your CLA to Apache.

Thanks,
Ashish



[jira] Updated: (HIVE-1501) when generating reentrant INSERT for index rebuild, quote identifiers using backticks

2010-11-02 Thread Skye Berghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skye Berghel updated HIVE-1501:
---

Status: Patch Available  (was: Open)

 when generating reentrant INSERT for index rebuild, quote identifiers using 
 backticks
 -

 Key: HIVE-1501
 URL: https://issues.apache.org/jira/browse/HIVE-1501
 Project: Hive
  Issue Type: Bug
  Components: Indexing
Affects Versions: 0.7.0
Reporter: John Sichi
Assignee: Skye Berghel
 Fix For: 0.7.0

 Attachments: 1501.patch, 1501_new_tests.patch, 1501_with_tests.patch, 
 HIVE-1501.4.patch


 Yongqiang, you mentioned that you weren't able to do this due to SORT BY not 
 accepting them.  The SORT BY is gone now as of HIVE-1494 (and SORT BY needs 
 to be fixed anyway).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1497) support COMMENT clause on CREATE INDEX, and add new commands for SHOW/DESCRIBE indexes

2010-11-02 Thread Russell Melick (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Melick updated HIVE-1497:
-

Attachment: HIVE-1497.4.patch

 support COMMENT clause on CREATE INDEX, and add new commands for 
 SHOW/DESCRIBE indexes
 --

 Key: HIVE-1497
 URL: https://issues.apache.org/jira/browse/HIVE-1497
 Project: Hive
  Issue Type: Improvement
  Components: Indexing
Affects Versions: 0.7.0
Reporter: John Sichi
Assignee: Russell Melick
 Fix For: 0.7.0

 Attachments: HIVE-1497.4.patch, hive-1497.p1.patch, 
 hive-1497.p2.patch, hive-1497.p3.patch


 We need to work out the syntax for SHOW/DESCRIBE, taking partitioning into 
 account.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927735#action_12927735
 ] 

Namit Jain commented on HIVE-1750:
--

The code changes look good to me.
Can you add some tests and do a explain plan for all kinds of scenarios:

ds  10 and x  5
ds  10 or x  5
ds  10 and x  5 and y  10
(ds  10 and x  5) or (ds  10 and y  5)
(ds  10 and x  5) or (ds  5 and y  5)


 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927737#action_12927737
 ] 

Namit Jain commented on HIVE-1750:
--

Amareshwari, can you also confirm the changes ?


 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (HIVE-1750) Remove Partition Filtering Conditions when Possible

2010-11-02 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927735#action_12927735
 ] 

Namit Jain edited comment on HIVE-1750 at 11/3/10 12:58 AM:


The code changes look good to me.
Can you add some tests and do a explain plan for all kinds of scenarios:

ds  10 and x  5
ds  10 or x  5
ds  10 and x  5 and y  10
(ds  10 and x  5) or (ds  10 and y  5)
(ds  10 and x  5) or (ds  5 and y  5)
(ds  10 or x  5) and (ds  5 or y  5)

  was (Author: namit):
The code changes look good to me.
Can you add some tests and do a explain plan for all kinds of scenarios:

ds  10 and x  5
ds  10 or x  5
ds  10 and x  5 and y  10
(ds  10 and x  5) or (ds  10 and y  5)
(ds  10 and x  5) or (ds  5 and y  5)

  
 Remove Partition Filtering Conditions when Possible
 ---

 Key: HIVE-1750
 URL: https://issues.apache.org/jira/browse/HIVE-1750
 Project: Hive
  Issue Type: Improvement
Reporter: Siying Dong
Assignee: Siying Dong
 Attachments: HIVE-1750.1.patch, HIVE-1750.2.patch, HIVE-1750.3.patch


 For some simple queries, partition filtering constraints take 8% of CPU time 
 (now 16% since we filter twice) even if the result is always true. When 
 possible, we should remove these constraints to save CPU times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.