[jira] [Commented] (TEZ-1080) Configuration for non MR based Inputs/Outputs

Rohini Palaniswamy (JIRA) Wed, 18 Jun 2014 13:48:56 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036347#comment-14036347
 ]


Rohini Palaniswamy commented on TEZ-1080:
-----------------------------------------

bq. What will not though, is having tez properties in a configuration file like 
hive-site or pig-site. Similarly properties specified via command line will 
need to be handled. Properties in those files will need to be handled by 
Hive/Pig. 
   Pig can take care of taking tez properties from pig-site and properties 
passed through command line and passing it to Tez. That should not be an issue.

bq. What this is trying to do is to is get rid of unnecessary settings which 
are otherwise sent over the wire to configure intermediate data edges.
   I understand this and the need to keep the payload simple. But this is going 
to add a lot of checks to pig and hive code. And also if Tez introduced a new 
setting we will have to make code changes and set that and will have to roll 
out a new release instead of the new setting just taking effect by passing it 
via tez-site.xml, pig-site.xml or command line like now. And that new release 
will not work with older version of Tez. That will make it really difficult. 
Just a thought. Instead of the builder can we have some API which looks at 
settings and just takes the settings it understands and returns a trimmed down 
version. For combiner and partitioner you can still have separate setter APIs 
though as we should definitely know what needs to be passed.

bq. Eventually, I'd imagine Pig would want to configure things like the sort 
buffer size based on container and data sizes, rather than letting users 
overwrite it.
   We really wish we could do it. We even wish that we could change container 
sizes based on data sizes. But that is not easily possible as we don't know the 
amount of data that is going to flow through beforehand. We only know about 
size of the data at the root vertex. We had the same problem with determining 
parallelism and we only ended up doing guestimates (if that does not work user 
can specify parallel clause). The one approach that we did discuss early was to 
allow user to set settings in different parts of the script and that would take 
effect for the lines below which is more manual.  If that had to be done 
automatically, an approach would be to have an external system store lot of 
stats on each of the vertex execution (Similar to what Twitter does with HRaven 
and pig) and use that to determine in advance.  So at the moment we are far 
from Pig determining optimal configuration beforehand.

> Configuration for non MR based Inputs/Outputs
> ---------------------------------------------
>
>                 Key: TEZ-1080
>                 URL: https://issues.apache.org/jira/browse/TEZ-1080
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-1080.wip.1.txt, TEZ-1080.wip.2.txt
>
>
> De-link configuration from MRHelpers (except for the YARNRunner case), and 
> allow for these to be configured easily - exposing necessary setters / 
> getters without having to rely on config keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1080) Configuration for non MR based Inputs/Outputs

Reply via email to