On 7/25/14, 3:20 PM, Johannes Zillmann wrote:
Ok, will try this, thanks!
Can you say the jira number so i can track progress on that ? You know for
which version of
Tez this is planned ?
And no, no more use cases currently!
That JIRA is a fairly complex scale problem, so I will take a while to
commit it, because it needs extensive testing.
But I could possibly split out the vertex parallelism feature out into
its own JIRA, to satisfy the common requirements.
Just to clarify that again, this WIP patch provides the vertex
parallelism for the current vertex to each task.
So I'm paraphrasing your use-case as
"Map 1" outputs sample-count/parallelism("Map 1")
And
"Reducer 2" uses parallelism("Reducer 2") to decide details of runtime.
My patch waits until the runtime parallelism is set for "Map 1" and
"Reducer 2" - input tasks can use "-1" at DAG build time, to have the
vertex manager set it up according to cluster capacity/split sizes.
Aggregation tasks can set it up between a min/max range, according to
data output sizes from the preceding vertex.
That is what Bikas confirmed on.
It seems to me that "scan until you find at least X records" for a
sparse sampling case is something that can benefit a lot from a custom
vertex manager plugin.
So I would like to hear more about your use case and see if the work
that goes into scan short-circuiting/avoidance in "select * from table
where x < 10 limit 10;" in hive would help you as well.
Cheers,
Gopal
On 24 Jul 2014, at 20:38, Bikas Saha <[email protected]> wrote:
The patch should work for all types of vertices because it gives each task
the total number of tasks for its vertex. Do you have any other use case?
Bikas
-----Original Message-----
From: Johannes Zillmann [mailto:[email protected]]
Sent: Thursday, July 24, 2014 6:58 AM
To: Gopal V
Cc: [email protected]
Subject: Re: Task count
Hey Gopal,
using the task count basically for 2 things (in mr for both the map stage
and the reduce stage):
- each task samples its output-data up to a certain number. This number is
the desired sample count divided by the number of tasks
- also we use the task count in some scenarios to let the last task (of a
stage or a vertex) do some extra logic. That plays in combination of the
task-index.
Looking at your patch it looks like it will do the job for kind of the
map-like vertex but not for the aggregation vertex, right ?
Also what jira issue is that ?
best
Johannes
On 24 Jul 2014, at 07:40, Gopal V <[email protected]> wrote:
On 7/23/14, 6:07 PM, Johannes Zillmann wrote:
Hey Tez team,
is there some way to get the task count within a vertex from within a
task ?
Some equivalent to mapred.map.tasks and mapred.reduce.tasks for
map-reduce ?
Could you explain the use-case for this particular requirement?
I intend to add the vertex parallelism to the task context as part of
one of my WIP branches.
I uploaded my base patch-set as is (including the TODO markers).
https://issues.apache.org/jira/secure/attachment/12657536/TEZ-broadcast-sh
uffle%2Bvertex-parallelism.patch
If you can explain what you are actually looking to do with this
information, perhaps I can roll the two feature reqs together.
Cheers,
Gopal
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.