[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-19 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941001#comment-13941001
 ] 

Gunther Hagleitner commented on HIVE-6613:
--

Ran tests locally. No errors.

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-6613.2.txt, HIVE-6613.3.patch, TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-18 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939919#comment-13939919
 ] 

Gunther Hagleitner commented on HIVE-6613:
--

+1 LGTM

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-6613.2.txt, HIVE-6613.3.patch, TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938521#comment-13938521
 ] 

Siddharth Seth commented on HIVE-6613:
--

Review board - https://reviews.apache.org/r/19327/

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-6613.2.txt, HIVE-6613.3.patch, TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-17 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938510#comment-13938510
 ] 

Gunther Hagleitner commented on HIVE-6613:
--

[~sseth] second patch seems to be missing the definition of the TezCacheAccess. 
Also if it's not too much trouble a review board link would be nice.

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-6613.2.txt, TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-12 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932870#comment-13932870
 ] 

Siddharth Seth commented on HIVE-6613:
--

Thanks for taking a look.

bq. Can you avoid creating a conf in TezCacheAccess? Maybe just pass it in 
get().
Was doing this once in the static block to avoid having to use a Configuration 
instance to access this class. TezCacheAccess is only supposed to be used with 
Tez. I could skip the factory all together and instantiate the Tez cache 
directly ? (The Configuration creation in this case should be very cheap since 
it isn't accessing external files)

bq. Have you considered adding the input to the cache key instead of using a 
Set? 
The set just groups the fact that they're cached together. I can use individual 
keys if you think that's better. That will get rid of the lock - since the 
primary purpose is to control the set creation.

bq. You can drop the getLocalWork check in the tez hashtable loader. Tez 
doesn't have local work.
bq. The javadoc of the init function needs to be updated with your changes.
Will fix

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6613) Control when spcific Inputs / Outputs are started

2014-03-12 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932593#comment-13932593
 ] 

Gunther Hagleitner commented on HIVE-6613:
--

Looks good. Cool to see this - should really help with memory in reducers.

Couple comments:

- Can you avoid creating a conf in TezCacheAccess? Maybe just pass it in get().
- Have you considered adding the input to the cache key instead of using a Set? 
That way you can also remove the lock (I'm assuming the ObjectRegistry handles 
that).
- You can drop the getLocalWork check in the tez hashtable loader. Tez doesn't 
have local work.
- The javadoc of the init function needs to be updated with your changes.

> Control when spcific Inputs / Outputs are started
> -
>
> Key: HIVE-6613
> URL: https://issues.apache.org/jira/browse/HIVE-6613
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-6613.1.txt
>
>
> When running with Tez - a couple of enhancement are possible
> 1) Avoid re-fetching data in case of MapJoins - since the data is likely to 
> be cached after the first run (container re-use for the same query)
> 2) Start Outputs only after required Inputs are ready - specifically useful 
> in case of Reduce - where shuffle requires a large memory, and the Output (if 
> it's a sorted output) also requires a fair amount of memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)