[ 
https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512245#comment-14512245
 ] 

Siddharth Seth commented on TEZ-2368:
-------------------------------------

The hashcode will not be sufficient since it can, and likely will be, different 
across two different JVMs. 
We currently end up generating data in ${appId}/constant/${uniqeId} - which 
makes cleanup very difficult.

The main limitation in changing this path is the way we are tied to the MR 
ShuffleHandler which only knows how to process this path.
Creating dag specific dirs is an option, but only after the ShuffleHandler 
changes.

When an external shuffleHandler is used - this API provides the relevant 
information to create dag specific dirs instead of the app dir directly.
The API isn't exposing the dagId directly. What it does expose is a small 
unique identifer for each dag running in an application - which can be useful. 
Caching would be an alternate use for something like this. It's similar to a 
vertexIndex API which exists on the context impls - which is present for 
exactly the same reason - to generate names.

bq. External services are not meant to be using these context classes in any 
case. Or am I missing something?
External services can use the components in the RuntimeLibrary, all of which 
depend on the Context classes. What that does mean is construction/usage of 
these classes will eventually need to be exposed as a limited public API - 
likely tied to specific Tez versions as it evolves.

> Make the dag number available in Context classes
> ------------------------------------------------
>
>                 Key: TEZ-2368
>                 URL: https://issues.apache.org/jira/browse/TEZ-2368
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-2368.1.txt
>
>
> Provide the dag number, which is a unique number, for each dag running within 
> an application in the TezInputContext, TezOutputContext, TezProcessorContext.
> When containers are re-used, or for external services, this can be used to 
> generate intermediate data to a dag specific directory instead of an 
> application specific directory, where it becomes difficult to differentiate 
> between different dags.
> The DAG name does provide this - but is not suitable for use in a directory 
> name. Hashing the name is an option, but can lead to collisions.
> Generating data into a dag specific directory will eventually only be usable 
> when we move away from the default MR handler, or enhance it to support an 
> additional parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to