[jira] Updated: (HIVE-1096) Hive Variables

2010-11-24 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1096:
--

Attachment: hive-1096-15.patch.txt

Adds xdocs.

 Hive Variables
 --

 Key: HIVE-1096
 URL: https://issues.apache.org/jira/browse/HIVE-1096
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Edward Capriolo
Assignee: Edward Capriolo
 Fix For: 0.7.0

 Attachments: 1096-9.diff, hive-1096-10-patch.txt, 
 hive-1096-11-patch.txt, hive-1096-12.patch.txt, hive-1096-15.patch.txt, 
 hive-1096-15.patch.txt, hive-1096-2.diff, hive-1096-7.diff, hive-1096-8.diff, 
 hive-1096.diff


 From mailing list:
 --Amazon Elastic MapReduce version of Hive seems to have a nice feature 
 called Variables. Basically you can define a variable via command-line 
 while invoking hive with -d DT=2009-12-09 and then refer to the variable via 
 ${DT} within the hive queries. This could be extremely useful. I can't seem 
 to find this feature even on trunk. Is this feature currently anywhere in the 
 roadmap?--
 This could be implemented in many places.
 A simple place to put this is 
 in Driver.compile or Driver.run we can do string substitutions at that level, 
 and further downstream need not be effected. 
 There could be some benefits to doing this further downstream, parser,plan. 
 but based on the simple needs we may not need to overthink this.
 I will get started on implementing in compile unless someone wants to discuss 
 this more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1096) Hive Variables

2010-11-24 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1096:
--

Attachment: hive-1096-20.patch.txt

Were two 15 patches. I bumped the number to 20 to clear any confusion.

 Hive Variables
 --

 Key: HIVE-1096
 URL: https://issues.apache.org/jira/browse/HIVE-1096
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Edward Capriolo
Assignee: Edward Capriolo
 Fix For: 0.7.0

 Attachments: 1096-9.diff, hive-1096-10-patch.txt, 
 hive-1096-11-patch.txt, hive-1096-12.patch.txt, hive-1096-15.patch.txt, 
 hive-1096-15.patch.txt, hive-1096-2.diff, hive-1096-20.patch.txt, 
 hive-1096-7.diff, hive-1096-8.diff, hive-1096.diff


 From mailing list:
 --Amazon Elastic MapReduce version of Hive seems to have a nice feature 
 called Variables. Basically you can define a variable via command-line 
 while invoking hive with -d DT=2009-12-09 and then refer to the variable via 
 ${DT} within the hive queries. This could be extremely useful. I can't seem 
 to find this feature even on trunk. Is this feature currently anywhere in the 
 roadmap?--
 This could be implemented in many places.
 A simple place to put this is 
 in Driver.compile or Driver.run we can do string substitutions at that level, 
 and further downstream need not be effected. 
 There could be some benefits to doing this further downstream, parser,plan. 
 but based on the simple needs we may not need to overthink this.
 I will get started on implementing in compile unless someone wants to discuss 
 this more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1792) track the joins which are being converted to map-join automatically

2010-11-24 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935399#action_12935399
 ] 

Namit Jain commented on HIVE-1792:
--

Why dont we do the same in plan/ConditionalResolverCommonJoin - there we know 
what is going on ?

Also, can we remove the unrelated changes -- for eg. using a different 
DistributedCache API etc. in this patch

 track the joins which are being converted to map-join automatically
 ---

 Key: HIVE-1792
 URL: https://issues.apache.org/jira/browse/HIVE-1792
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1792-1.patch, hive-1792-2.patch, hive-1792-3.patch


 We should be able to track how many queries (join) got converted to
 map-join

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1792) track the joins which are being converted to map-join automatically

2010-11-24 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935450#action_12935450
 ] 

Namit Jain commented on HIVE-1792:
--

+1

running tests

 track the joins which are being converted to map-join automatically
 ---

 Key: HIVE-1792
 URL: https://issues.apache.org/jira/browse/HIVE-1792
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1792-1.patch, hive-1792-2.patch, hive-1792-3.patch, 
 hive-1792-4.patch


 We should be able to track how many queries (join) got converted to
 map-join

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-650) [UDAF] implement GROUP_CONCAT(expr)

2010-11-24 Thread Jeff Hammerbacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Hammerbacher resolved HIVE-650.


Resolution: Duplicate

Resolving as duplicate of HIVE-707 and to concentrate conversation on that 
ticket (since most of the discussion has happened there).

 [UDAF]  implement  GROUP_CONCAT(expr)
 -

 Key: HIVE-650
 URL: https://issues.apache.org/jira/browse/HIVE-650
 Project: Hive
  Issue Type: New Feature
Reporter: Min Zhou

 It's a very useful udaf for us. 
 http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat
 GROUP_CONCAT(expr)
 This function returns a string result with the concatenated non-NULL values 
 from a group. It returns NULL if there are no non-NULL values. The full 
 syntax is as follows: 
 GROUP_CONCAT([DISTINCT] expr [,expr ...]
  [ORDER BY {unsigned_integer | col_name | expr}
  [ASC | DESC] [,col_name ...]]
  [SEPARATOR str_val])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-707) add group_concat

2010-11-24 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935463#action_12935463
 ] 

Jeff Hammerbacher commented on HIVE-707:


Hey,

Given that this JIRA has been opened three separate times, and that I have 
received a recent request for it in IRC, I think it would be worth bumping to 
near the top of the queue.

Thanks,
Jeff

 add group_concat
 

 Key: HIVE-707
 URL: https://issues.apache.org/jira/browse/HIVE-707
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Min Zhou

 Moving the discussion to a new jira:
 I've implemented group_cat() in a rush, and found something difficult to 
 slove:
 1. function group_cat() has a internal order by clause, currently, we can't 
 implement such an aggregation in hive.
 2. when the strings will be group concated are too large, in another words, 
 if data skew appears, there is often not enough memory to store such a big 
 result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache

2010-11-24 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1797:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed! Thanks Liyin!

 Compressed the hashtable dump file before put into distributed cache
 

 Key: HIVE-1797
 URL: https://issues.apache.org/jira/browse/HIVE-1797
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: hive-1797.patch, hive-1797_3.patch


 Clearly, the size of small table is the performance bottleneck for map join.
 Because the size of the small table will affect the memory usage and dumped 
 hashtable file.
 That means there are 2 boundaries of the map join performance.
 1)The memory usage for local task and mapred task
 2)The dumped hashtable file size for distributed cache
 The reason that test case in last email spends most of the execution time on 
 initializing is because it hits the second boundary.
 Since we have already bound the memory usage, one thing we can do is to let 
 the performance never hits the secondary bound before it hits the first 
 boundary.
 Assuming the heap size is 1.6 G and the small table file size is 15M 
 compressed (75M uncompressed),
 local  task can roughly hold that 1.5M unique rows in memory. 
 Roughly the dumped file size will be 150M, which is too large to put into the 
 distributed cache.
  
 From experiments, we can basically conclude when the dumped file size is 
 smaller than 30M. 
 The distributed cache works well and all the mappers will  be initialized in 
 a short time (less than 30 secs).
 One easy implementation is to compress the hashtable file. 
 I use the gzip to compress the hashtable file and the file size is compressed 
 from 100M to 13M.
 After several tests, all the mappers will be initialized in less than 23 secs.
 But this solution adds some decompression overhead to each mapper.
 Mappers on the same machine will do the duplicated decompression work.
 Maybe in the future, we can let the distributed cache to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables

2010-11-24 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935528#action_12935528
 ] 

He Yongqiang commented on HIVE-1804:


This patch can not be applied cleanly. Can you regenerate a new diff?

 Mapjoin will fail if there are no files associating with the join tables
 

 Key: HIVE-1804
 URL: https://issues.apache.org/jira/browse/HIVE-1804
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1804-1.patch, hive-1804-2.patch


 If there are some empty tables without any file associated, the map join will 
 fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables

2010-11-24 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1804:
---

Status: Open  (was: Patch Available)

 Mapjoin will fail if there are no files associating with the join tables
 

 Key: HIVE-1804
 URL: https://issues.apache.org/jira/browse/HIVE-1804
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1804-1.patch, hive-1804-2.patch


 If there are some empty tables without any file associated, the map join will 
 fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1811) Show the time the local task takes

2010-11-24 Thread Liyin Tang (JIRA)
Show the time the local task takes
--

 Key: HIVE-1811
 URL: https://issues.apache.org/jira/browse/HIVE-1811
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0


After the local tasks finished, show the how much  time it takes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables

2010-11-24 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HIVE-1804:
-

Attachment: hive-1804-3.patch

Since there are some other patches committed recently, I regenerate the patch 
after svn update.
Please review.

 Mapjoin will fail if there are no files associating with the join tables
 

 Key: HIVE-1804
 URL: https://issues.apache.org/jira/browse/HIVE-1804
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1804-1.patch, hive-1804-2.patch, hive-1804-3.patch


 If there are some empty tables without any file associated, the map join will 
 fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1648) Automatically gathering stats when reading a table/partition

2010-11-24 Thread Paul Butler (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935551#action_12935551
 ] 

Paul Butler commented on HIVE-1648:
---

Namit, it looks like show table extended like `table_name`; doesn't print the 
number of rows. Unless there's a way to make it do that, I'll have to stick 
with desc extended.

I sent you an email for clarification on the ConditionalTasks also.

 Automatically gathering stats when reading a table/partition
 

 Key: HIVE-1648
 URL: https://issues.apache.org/jira/browse/HIVE-1648
 Project: Hive
  Issue Type: Sub-task
Reporter: Ning Zhang
Assignee: Paul Butler
 Attachments: HIVE-1648.2.patch, HIVE-1648.3.patch, HIVE-1648.patch


 HIVE-1361 introduces a new command 'ANALYZE TABLE T COMPUTE STATISTICS' to 
 gathering stats. This requires additional scan of the data. Stats gathering 
 can be piggy-backed on TableScanOperator whenever a table/partition is 
 scanned (given not LIMIT operator). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1811) Show the time the local task takes

2010-11-24 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HIVE-1811:
-

Attachment: hive-1811-1.patch

The original showTime code has potential bug if the local task takes more than 
60 sec.
This patch fixes this bug.

 Show the time the local task takes
 --

 Key: HIVE-1811
 URL: https://issues.apache.org/jira/browse/HIVE-1811
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.7.0

 Attachments: hive-1811-1.patch


 After the local tasks finished, show the how much  time it takes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.