[
https://issues.apache.org/jira/browse/TEZ-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036347#comment-14036347
]
Rohini Palaniswamy commented on TEZ-1080:
-----------------------------------------
bq. What will not though, is having tez properties in a configuration file like
hive-site or pig-site. Similarly properties specified via command line will
need to be handled. Properties in those files will need to be handled by
Hive/Pig.
Pig can take care of taking tez properties from pig-site and properties
passed through command line and passing it to Tez. That should not be an issue.
bq. What this is trying to do is to is get rid of unnecessary settings which
are otherwise sent over the wire to configure intermediate data edges.
I understand this and the need to keep the payload simple. But this is going
to add a lot of checks to pig and hive code. And also if Tez introduced a new
setting we will have to make code changes and set that and will have to roll
out a new release instead of the new setting just taking effect by passing it
via tez-site.xml, pig-site.xml or command line like now. And that new release
will not work with older version of Tez. That will make it really difficult.
Just a thought. Instead of the builder can we have some API which looks at
settings and just takes the settings it understands and returns a trimmed down
version. For combiner and partitioner you can still have separate setter APIs
though as we should definitely know what needs to be passed.
bq. Eventually, I'd imagine Pig would want to configure things like the sort
buffer size based on container and data sizes, rather than letting users
overwrite it.
We really wish we could do it. We even wish that we could change container
sizes based on data sizes. But that is not easily possible as we don't know the
amount of data that is going to flow through beforehand. We only know about
size of the data at the root vertex. We had the same problem with determining
parallelism and we only ended up doing guestimates (if that does not work user
can specify parallel clause). The one approach that we did discuss early was to
allow user to set settings in different parts of the script and that would take
effect for the lines below which is more manual. If that had to be done
automatically, an approach would be to have an external system store lot of
stats on each of the vertex execution (Similar to what Twitter does with HRaven
and pig) and use that to determine in advance. So at the moment we are far
from Pig determining optimal configuration beforehand.
> Configuration for non MR based Inputs/Outputs
> ---------------------------------------------
>
> Key: TEZ-1080
> URL: https://issues.apache.org/jira/browse/TEZ-1080
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Siddharth Seth
> Attachments: TEZ-1080.wip.1.txt, TEZ-1080.wip.2.txt
>
>
> De-link configuration from MRHelpers (except for the YARNRunner case), and
> allow for these to be configured easily - exposing necessary setters /
> getters without having to rely on config keys.
--
This message was sent by Atlassian JIRA
(v6.2#6252)