Re: Task count

Gopal V Fri, 25 Jul 2014 10:10:32 -0700

On 7/25/14, 3:20 PM, Johannes Zillmann wrote:

Ok, will try this, thanks!
Can you say the jira number so i can track progress on that ? You know for 
which version of
Tez this is planned ?


And no, no more use cases currently!

That JIRA is a fairly complex scale problem, so I will take a while tocommit it, because it needs extensive testing.

But I could possibly split out the vertex parallelism feature out intoits own JIRA, to satisfy the common requirements.

Just to clarify that again, this WIP patch provides the vertexparallelism for the current vertex to each task.


So I'm paraphrasing your use-case as

"Map 1" outputs sample-count/parallelism("Map 1")

And

"Reducer 2" uses parallelism("Reducer 2") to decide details of runtime.

My patch waits until the runtime parallelism is set for "Map 1" and"Reducer 2" - input tasks can use "-1" at DAG build time, to have thevertex manager set it up according to cluster capacity/split sizes.Aggregation tasks can set it up between a min/max range, according todata output sizes from the preceding vertex.


That is what Bikas confirmed on.

It seems to me that "scan until you find at least X records" for asparse sampling case is something that can benefit a lot from a customvertex manager plugin.

So I would like to hear more about your use case and see if the workthat goes into scan short-circuiting/avoidance in "select * from tablewhere x < 10 limit 10;" in hive would help you as well.


Cheers,
Gopal

On 24 Jul 2014, at 20:38, Bikas Saha <[email protected]> wrote:

The patch should work for all types of vertices because it gives each task
the total number of tasks for its vertex. Do you have any other use case?

Bikas

-----Original Message-----
From: Johannes Zillmann [mailto:[email protected]]
Sent: Thursday, July 24, 2014 6:58 AM
To: Gopal V
Cc: [email protected]
Subject: Re: Task count

Hey Gopal,

using the task count basically for 2 things (in mr for both the map stage
and the reduce stage):
- each task samples its output-data up to a certain number. This number is
the desired sample count divided by the number of tasks
- also we use the task count in some scenarios to let the last task (of a
stage or a vertex) do some extra logic. That plays in combination of the
task-index.

Looking at your patch it looks like it will do the job for kind of the
map-like vertex but not for the aggregation vertex, right ?
Also what jira issue is that ?

best
Johannes

On 24 Jul 2014, at 07:40, Gopal V <[email protected]> wrote:

On 7/23/14, 6:07 PM, Johannes Zillmann wrote:

Hey Tez team,

is there some way to get the task count within a vertex from within a

task ?

Some equivalent to mapred.map.tasks and mapred.reduce.tasks for

map-reduce ?


Could you explain the use-case for this particular requirement?

I intend to add the vertex parallelism to the task context as part of

one of my WIP branches.


I uploaded my base patch-set as is (including the TODO markers).

https://issues.apache.org/jira/secure/attachment/12657536/TEZ-broadcast-sh
uffle%2Bvertex-parallelism.patch


If you can explain what you are actually looking to do with this

information, perhaps I can roll the two feature reqs together.


Cheers,
Gopal


--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Re: Task count

Reply via email to