Hey Gopal, i went ahead and created https://issues.apache.org/jira/browse/TEZ-1360 for the task count thing!
Johannes On 25 Jul 2014, at 19:09, Gopal V <[email protected]> wrote: > On 7/25/14, 3:20 PM, Johannes Zillmann wrote: > >> Ok, will try this, thanks! >> Can you say the jira number so i can track progress on that ? You know for >> which version of >> Tez this is planned ? >> >> And no, no more use cases currently! > > That JIRA is a fairly complex scale problem, so I will take a while to commit > it, because it needs extensive testing. > > But I could possibly split out the vertex parallelism feature out into its > own JIRA, to satisfy the common requirements. > > Just to clarify that again, this WIP patch provides the vertex parallelism > for the current vertex to each task. > > So I'm paraphrasing your use-case as > > "Map 1" outputs sample-count/parallelism("Map 1") > > And > > "Reducer 2" uses parallelism("Reducer 2") to decide details of runtime. > > My patch waits until the runtime parallelism is set for "Map 1" and "Reducer > 2" - input tasks can use "-1" at DAG build time, to have the vertex manager > set it up according to cluster capacity/split sizes. Aggregation tasks can > set it up between a min/max range, according to data output sizes from the > preceding vertex. > > That is what Bikas confirmed on. > > It seems to me that "scan until you find at least X records" for a sparse > sampling case is something that can benefit a lot from a custom vertex > manager plugin. > > So I would like to hear more about your use case and see if the work that > goes into scan short-circuiting/avoidance in "select * from table where x < > 10 limit 10;" in hive would help you as well. > > Cheers, > Gopal > >> On 24 Jul 2014, at 20:38, Bikas Saha <[email protected]> wrote: >> >>> The patch should work for all types of vertices because it gives each task >>> the total number of tasks for its vertex. Do you have any other use case? >>> >>> Bikas >>> >>> -----Original Message----- >>> From: Johannes Zillmann [mailto:[email protected]] >>> Sent: Thursday, July 24, 2014 6:58 AM >>> To: Gopal V >>> Cc: [email protected] >>> Subject: Re: Task count >>> >>> Hey Gopal, >>> >>> using the task count basically for 2 things (in mr for both the map stage >>> and the reduce stage): >>> - each task samples its output-data up to a certain number. This number is >>> the desired sample count divided by the number of tasks >>> - also we use the task count in some scenarios to let the last task (of a >>> stage or a vertex) do some extra logic. That plays in combination of the >>> task-index. >>> >>> Looking at your patch it looks like it will do the job for kind of the >>> map-like vertex but not for the aggregation vertex, right ? >>> Also what jira issue is that ? >>> >>> best >>> Johannes >>> >>> On 24 Jul 2014, at 07:40, Gopal V <[email protected]> wrote: >>> >>>> On 7/23/14, 6:07 PM, Johannes Zillmann wrote: >>>>> Hey Tez team, >>>>> >>>>> is there some way to get the task count within a vertex from within a >>> task ? >>>>> Some equivalent to mapred.map.tasks and mapred.reduce.tasks for >>> map-reduce ? >>>> >>>> Could you explain the use-case for this particular requirement? >>>> >>>> I intend to add the vertex parallelism to the task context as part of >>> one of my WIP branches. >>>> >>>> I uploaded my base patch-set as is (including the TODO markers). >>>> >>>> >>> https://issues.apache.org/jira/secure/attachment/12657536/TEZ-broadcast-sh >>> uffle%2Bvertex-parallelism.patch >>>> >>>> If you can explain what you are actually looking to do with this >>> information, perhaps I can roll the two feature reqs together. >>>> >>>> Cheers, >>>> Gopal >>>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender immediately >>> and delete it from your system. Thank You. >> >> >> >
