Re: Task count

Johannes Zillmann Fri, 01 Aug 2014 09:57:06 -0700

Hey Gopal,

i went ahead and created https://issues.apache.org/jira/browse/TEZ-1360 for the 
task count thing!


Johannes

On 25 Jul 2014, at 19:09, Gopal V <[email protected]> wrote:

> On 7/25/14, 3:20 PM, Johannes Zillmann wrote:
> 
>> Ok, will try this, thanks!
>> Can you say the jira number so i can track progress on that ? You know for 
>> which version of
>> Tez this is planned ?
>> 
>> And no, no more use cases currently!
> 
> That JIRA is a fairly complex scale problem, so I will take a while to commit 
> it, because it needs extensive testing.
> 
> But I could possibly split out the vertex parallelism feature out into its 
> own JIRA, to satisfy the common requirements.
> 
> Just to clarify that again, this WIP patch provides the vertex parallelism 
> for the current vertex to each task.
> 
> So I'm paraphrasing your use-case as
> 
> "Map 1" outputs sample-count/parallelism("Map 1")
> 
> And
> 
> "Reducer 2" uses parallelism("Reducer 2") to decide details of runtime.
> 
> My patch waits until the runtime parallelism is set for "Map 1" and "Reducer 
> 2" - input tasks can use "-1" at DAG build time, to have the vertex manager 
> set it up according to cluster capacity/split sizes. Aggregation tasks can 
> set it up between a min/max range, according to data output sizes from the 
> preceding vertex.
> 
> That is what Bikas confirmed on.
> 
> It seems to me that "scan until you find at least X records" for a sparse 
> sampling case is something that can benefit a lot from a custom vertex 
> manager plugin.
> 
> So I would like to hear more about your use case and see if the work that 
> goes into scan short-circuiting/avoidance in "select * from table where x < 
> 10 limit 10;" in hive would help you as well.
> 
> Cheers,
> Gopal
> 
>> On 24 Jul 2014, at 20:38, Bikas Saha <[email protected]> wrote:
>> 
>>> The patch should work for all types of vertices because it gives each task
>>> the total number of tasks for its vertex. Do you have any other use case?
>>> 
>>> Bikas
>>> 
>>> -----Original Message-----
>>> From: Johannes Zillmann [mailto:[email protected]]
>>> Sent: Thursday, July 24, 2014 6:58 AM
>>> To: Gopal V
>>> Cc: [email protected]
>>> Subject: Re: Task count
>>> 
>>> Hey Gopal,
>>> 
>>> using the task count basically for 2 things (in mr for both the map stage
>>> and the reduce stage):
>>> - each task samples its output-data up to a certain number. This number is
>>> the desired sample count divided by the number of tasks
>>> - also we use the task count in some scenarios to let the last task (of a
>>> stage or a vertex) do some extra logic. That plays in combination of the
>>> task-index.
>>> 
>>> Looking at your patch it looks like it will do the job for kind of the
>>> map-like vertex but not for the aggregation vertex, right ?
>>> Also what jira issue is that ?
>>> 
>>> best
>>> Johannes
>>> 
>>> On 24 Jul 2014, at 07:40, Gopal V <[email protected]> wrote:
>>> 
>>>> On 7/23/14, 6:07 PM, Johannes Zillmann wrote:
>>>>> Hey Tez team,
>>>>> 
>>>>> is there some way to get the task count within a vertex from within a
>>> task ?
>>>>> Some equivalent to mapred.map.tasks and mapred.reduce.tasks for
>>> map-reduce ?
>>>> 
>>>> Could you explain the use-case for this particular requirement?
>>>> 
>>>> I intend to add the vertex parallelism to the task context as part of
>>> one of my WIP branches.
>>>> 
>>>> I uploaded my base patch-set as is (including the TODO markers).
>>>> 
>>>> 
>>> https://issues.apache.org/jira/secure/attachment/12657536/TEZ-broadcast-sh
>>> uffle%2Bvertex-parallelism.patch
>>>> 
>>>> If you can explain what you are actually looking to do with this
>>> information, perhaps I can roll the two feature reqs together.
>>>> 
>>>> Cheers,
>>>> Gopal
>>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>> 
>> 
>> 
>

Re: Task count

Reply via email to