[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186379#comment-15186379 ] Liwei Lin commented on SPARK-10620: --- hi [~andrewor14], in the "\[3\] A Simpler Accumulator API" section of the design doc: {quote} Since the design of this is mostly orthogonal to the rest of this document, here we only outline the desire for a new, simpler API, and does not discuss the solution. The actual design will be in a separate design doc. {quote} Anywhere to find that separate "Simpler Accumulator API" design doc please? Thanks! > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 2.0.0 > > Attachments: accums-and-task-metrics.pdf > > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120311#comment-15120311 ] Apache Spark commented on SPARK-10620: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10958 > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > Attachments: accums-and-task-metrics.pdf > > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105619#comment-15105619 ] Apache Spark commented on SPARK-10620: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10810 > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105643#comment-15105643 ] Apache Spark commented on SPARK-10620: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10811 > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093209#comment-15093209 ] Apache Spark commented on SPARK-10620: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10717 > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803184#comment-14803184 ] Imran Rashid commented on SPARK-10620: -- I think you've done a good job of summarizing the key issues to consider. Can I ask that we back up one step further, and start by asking what properties we want from our metric system? I'm not at all in love with the current TaskMetrics, I just don't see accumulators as a good replacement. Because accumulators are a public API, we are kind of stuck with the current semantics. We get a bit of wiggle room w/ internal accumulators, but not a lot. What are the things we dislike about TaskMetrics? I think its: (a) there is a ton of boilerplate that you need to write for every new metric, making adding each one a huge pain (b) its a nuisance to filter the metrics for the common use cases -- eg., its easy to accidentally overcount when there is task failure or speculation, etc. Some other key differences from accumulators -- I think these are an advantage of TaskMetrics, but maybe others see them as a disadvantage? (c) the metrics are strongly typed. Both b/c the name of the metric is a member, so typos like "metrics.shuffleRaedMetrics" are compile errors, and also the value is strongly typed, not just the toString of something. (d) metrics can be aggregated in a variety of ways. Eg., you can get the sum of the metric across tasks, the distribution, a timeline of parital sums, etc. You could do this with the individual values of accumulators too, but its worth pointing out that if this is what you use them for, they aren't really "accumulating", they're just a per-task holder. I feel like there are other designs we could consider that get around the current limitations. For example, if each metric was keyed by an enum, and they were stored in an EnumMap, then you'd get easy iteration so you could eliminate lots of boilerplate (a), it'd be easier to write utility functions for common filters (b), you'd still get type safety (c) and flexibility in aggregation (d). I've been told I have an obsession with EnumMaps, so maybe others wont' be as keen on them -- but my main point is simply that I don't think we have only two alternatives here, and I'd prefer we take the time to consider this more completely. (just conversion back and forth to strings is enough to make me feel like accumulators are a kludge.) I also want to point out that its really hard to get things right for failures, not because its hard to implement, but because its hard to decide what the right *semantics* should be. For instance: * If there is stage failure, and a stage is resubmitted but with only a small subset of tasks, what should the aggregated value be in the UI? The value of just that stage attempt? Or should it aggregate over all attempts? Or aggregate in such a way that each *partition* is only counted once, favoring the most recent successful attempt for each partition? There is a case to be made for all three. * Suppose a user is comparing how much data is read from hdfs from two different runs of a job -- one with speculative execution & intermittent task failure, and the other without either (a "normal" run). The average user would likely want to see the same amount of data read from hdfs in both jobs. OTOH, they are actually reading different amounts of data. While this difference may not get exposed in the standard web UI, do we want to let advanced users have any access to this difference, or is it an unsupported use case? This isn't directly related to TaskMetrics vs. Accumulators, but goes to my overall point about considering the design. Thanks for brining this up, I think this is a great thing for us to be thinking about and working to improve. I hope I'm not derailing the conversation too much. > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry
[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745690#comment-14745690 ] Patrick Wendell commented on SPARK-10620: - /cc [~imranr] and [~srowen] for any comments. In my mind the goal here is just to produce some design thoughts and not to actually do it (at this point). > Look into whether accumulator mechanism can replace TaskMetrics > --- > > Key: SPARK-10620 > URL: https://issues.apache.org/jira/browse/SPARK-10620 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > > This task is simply to explore whether the internal representation used by > TaskMetrics could be performed by using accumulators rather than having two > separate mechanisms. Note that we need to continue to preserve the existing > "Task Metric" data structures that are exposed to users through event logs > etc. The question is can we use a single internal codepath and perhaps make > this easier to extend in the future. > I think a full exploration would answer the following questions: > - How do the semantics of accumulators on stage retries differ from aggregate > TaskMetrics for a stage? Could we implement clearer retry semantics for > internal accumulators to allow them to be the same - for instance, zeroing > accumulator values if a stage is retried (see discussion here: SPARK-10042). > - Are there metrics that do not fit well into the accumulator model, or would > be difficult to update as an accumulator. > - If we expose metrics through accumulators in the future rather than > continuing to add fields to TaskMetrics, what is the best way to coerce > compatibility? > - Are there any other considerations? > - Is it worth it to do this, or is the consolidation too complicated to > justify? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org