[jira] [Commented] (TEZ-2119) Counter for launched containers
[ https://issues.apache.org/jira/browse/TEZ-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053860#comment-17053860 ] Bikas Saha commented on TEZ-2119: - Been a while. The intent of total_used might have been to maintain the total containers used irrespective of losses, returns and re-acquisitions. Initial_held is number held even when nothing is running (hot start). > Counter for launched containers > --- > > Key: TEZ-2119 > URL: https://issues.apache.org/jira/browse/TEZ-2119 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: László Bodor >Priority: Major > Attachments: TEZ-2119.01.patch > > > org.apache.tez.common.counters.DAGCounter > NUM_SUCCEEDED_TASKS=32976 > TOTAL_LAUNCHED_TASKS=32976 > OTHER_LOCAL_TASKS=2 > DATA_LOCAL_TASKS=9147 > RACK_LOCAL_TASKS=23761 > It would be very nice to have TOTAL_LAUNCHED_CONTAINERS counter added to > this. The difference between TOTAL_LAUNCHED_CONTAINERS and > TOTAL_LAUNCHED_TASKS should make it easy to see how much container reuse is > happening. It is very hard to find out now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TEZ-1786) Support for speculation of slow tasks
[ https://issues.apache.org/jira/browse/TEZ-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156018#comment-16156018 ] Bikas Saha commented on TEZ-1786: - Thats correct. > Support for speculation of slow tasks > - > > Key: TEZ-1786 > URL: https://issues.apache.org/jira/browse/TEZ-1786 > Project: Apache Tez > Issue Type: New Feature >Reporter: Bikas Saha >Assignee: Bikas Saha > > Umbrella jira to track speculation of attempts to mitigate stragglers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065756#comment-16065756 ] Bikas Saha commented on TEZ-3770: - Just clarifying that the original scheduler was not made dag aware by design. It was an attempt to prevent leaky features where code changed across the scheduler and the dag state machine. Like it happened in MR code where logic was spread all over. The DAG core logic and VertexManager user logic could determine the dependencies and priorities of tasks and the scheduler would allocate resources based on priority. So other schedulers could be easily written since they dont need to understand complex relationships. However not all of those design assumptions have been validated since we dont have many schedulers written :P > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-394) Better scheduling for uneven DAGs
[ https://issues.apache.org/jira/browse/TEZ-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030365#comment-16030365 ] Bikas Saha commented on TEZ-394: Not sure I understood this correctly. bq.V1->V3->V4->V5 bq.V2->V5 bq.V6->V7 The V2 being lower priority seems to be similar to the original description issue of this jira. V6 -> V7 being disconnected from the other vertices makes sense. For that using current approach of distance from root or distance from leaf, both would give V6 high priority. Is the intent to make V6 low priority? > Better scheduling for uneven DAGs > - > > Key: TEZ-394 > URL: https://issues.apache.org/jira/browse/TEZ-394 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jason Lowe > Attachments: TEZ-394.001.patch, TEZ-394.002.patch, TEZ-394.003.patch > > > Consider a series of joins or group by on dataset A with few datasets that > takes 10 hours followed by a final join with a dataset X. The vertex that > loads dataset X will be one of the top vertexes and initialized early even > though its output is not consumed till the end after 10 hours. > 1) Could either use delayed start logic for better resource allocation > 2) Else if they are started upfront, need to handle failure/recovery cases > where the nodes which executed the MapTask might have gone down when the > final join happens. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3696) Jobs can hang when both concurrency and speculation are enabled
[ https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997706#comment-15997706 ] Bikas Saha commented on TEZ-3696: - Thanks [~ebadger]! I missed that part of the code. Makes sense. > Jobs can hang when both concurrency and speculation are enabled > --- > > Key: TEZ-3696 > URL: https://issues.apache.org/jira/browse/TEZ-3696 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger > Fix For: 0.9.0, 0.8.6 > > Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch, > TEZ-3696.003.patch, TEZ-3696.004.patch > > > We can reproduce the hung job by doing the following: > 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks > {noformat} > HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar > $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1 > -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=6 -m 3 -mt 6 > -ir 0 -irt 0 -r 0 -rt 0 > {noformat} > 2. Let the 1st task run to completion and then stop the 2nd task so that a > speculative attempt is scheduled. Once the speculative attempt is scheduled > for the 2nd task, continue the original attempt and let it complete. > {noformat} > kill -STOP > // wait a few seconds for a speculative attempt to kick off > kill -CONT > {noformat} > 3. Kill the 3rd task, which will create a 2nd attempt > {noformat} > kill -9 > {noformat} > 4. The next thing to be drawn off of the queue will be the speculative > attempt of the 2nd task. However, it is already completed, so it will just > sit in the final state and the job will hang. > Basically, for the failure to happen, the number of speculative tasks that > are scheduled, but not yet ran has to be >= the concurrency of the job and > there has to be at least 1 task failure. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3696) Jobs can hang when both concurrency and speculation are enabled
[ https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993419#comment-15993419 ] Bikas Saha commented on TEZ-3696: - Thanks for the ping. Looking at the code again, I am not sure why I had the check for succeeded attempts for sending the completed event. I renamed the event type from succeeded to completed in the same patch and hence I may have intended to stop differentiating them. But the sending code for the completed event was under the succeeded check. That seems inconsistent. If the above is correct, then perhaps the issue would happen even without speculation and on every attempt failure. Because a failed attempt would not decrease the running count and so its retry would not get scheduled, leading to an off-by-N situation in the concurrency count in the dag scheduler. If this is correct, then perhaps the fix is only to send the completed event all the time and not make any other changes in the dag scheduler itself. > Jobs can hang when both concurrency and speculation are enabled > --- > > Key: TEZ-3696 > URL: https://issues.apache.org/jira/browse/TEZ-3696 > Project: Apache Tez > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger > Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch, > TEZ-3696.003.patch > > > We can reproduce the hung job by doing the following: > 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks > {noformat} > HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar > $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1 > -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=6 -m 3 -mt 6 > -ir 0 -irt 0 -r 0 -rt 0 > {noformat} > 2. Let the 1st task run to completion and then stop the 2nd task so that a > speculative attempt is scheduled. Once the speculative attempt is scheduled > for the 2nd task, continue the original attempt and let it complete. > {noformat} > kill -STOP > // wait a few seconds for a speculative attempt to kick off > kill -CONT > {noformat} > 3. Kill the 3rd task, which will create a 2nd attempt > {noformat} > kill -9 > {noformat} > 4. The next thing to be drawn off of the queue will be the speculative > attempt of the 2nd task. However, it is already completed, so it will just > sit in the final state and the job will hang. > Basically, for the failure to happen, the number of speculative tasks that > are scheduled, but not yet ran has to be >= the concurrency of the job and > there has to be at least 1 task failure. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (TEZ-394) Better scheduling for uneven DAGs
[ https://issues.apache.org/jira/browse/TEZ-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865631#comment-15865631 ] Bikas Saha edited comment on TEZ-394 at 2/14/17 12:30 PM: -- Thanks for doing this! I regret not having done this right from the start. Mostly looks good to me. The name of the assigned variable is now misleading because its not topo sorted anymore. {code}+topologicalVertexStack = reorderForCriticalPath(topologicalVertexStack, +vertexMap, inboundVertexMap, outboundVertexMap);{code} [~rohini] IIRC, this will only change the vertex priority wrt other vertices. Vertices would still be scheduled based on their managers and typically based on completion of their inputs. So Root1, Root2 would both be ready and start running. Int3 would currently be blocked behind both but after this would be preferred to Root2 after Int3 is deemed capable of running. [~gopalv] Would this break any assumptions in Hive? was (Author: bikassaha): Thanks for doing this! I regret not having done this right from the start. Mostly looks good to me. The name of the assigned variable is now misleading because its not topo sorted anymore. {code}+topologicalVertexStack = reorderForCriticalPath(topologicalVertexStack, +vertexMap, inboundVertexMap, outboundVertexMap);{code} [~gopalv] Would this break any assumptions in Hive? > Better scheduling for uneven DAGs > - > > Key: TEZ-394 > URL: https://issues.apache.org/jira/browse/TEZ-394 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jason Lowe > Attachments: TEZ-394.001.patch > > > Consider a series of joins or group by on dataset A with few datasets that > takes 10 hours followed by a final join with a dataset X. The vertex that > loads dataset X will be one of the top vertexes and initialized early even > though its output is not consumed till the end after 10 hours. > 1) Could either use delayed start logic for better resource allocation > 2) Else if they are started upfront, need to handle failure/recovery cases > where the nodes which executed the MapTask might have gone down when the > final join happens. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-394) Better scheduling for uneven DAGs
[ https://issues.apache.org/jira/browse/TEZ-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865631#comment-15865631 ] Bikas Saha commented on TEZ-394: Thanks for doing this! I regret not having done this right from the start. Mostly looks good to me. The name of the assigned variable is now misleading because its not topo sorted anymore. {code}+topologicalVertexStack = reorderForCriticalPath(topologicalVertexStack, +vertexMap, inboundVertexMap, outboundVertexMap);{code} [~gopalv] Would this break any assumptions in Hive? > Better scheduling for uneven DAGs > - > > Key: TEZ-394 > URL: https://issues.apache.org/jira/browse/TEZ-394 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Rohini Palaniswamy >Assignee: Jason Lowe > Attachments: TEZ-394.001.patch > > > Consider a series of joins or group by on dataset A with few datasets that > takes 10 hours followed by a final join with a dataset X. The vertex that > loads dataset X will be one of the top vertexes and initialized early even > though its output is not consumed till the end after 10 hours. > 1) Could either use delayed start logic for better resource allocation > 2) Else if they are started upfront, need to handle failure/recovery cases > where the nodes which executed the MapTask might have gone down when the > final join happens. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3512) Update EdgePlan proto for named edge
[ https://issues.apache.org/jira/browse/TEZ-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768737#comment-15768737 ] Bikas Saha commented on TEZ-3512: - How can we be sure that SrcDest or DestSrc set by the AM will not conflict with an edge name set by the user? If we can be sure of that in the AM why can we not be sure of that in the client? What am I missing here? Clearly you seem to have something clear in your mind that I am missing. > Update EdgePlan proto for named edge > > > Key: TEZ-3512 > URL: https://issues.apache.org/jira/browse/TEZ-3512 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Zhiyuan Yang >Assignee: Zhiyuan Yang > Attachments: TEZ-3512.1.patch, TEZ-3512.2.patch > > > EdgePlan (protobuf) should have one more field for edge name. Related DAG > plan creation and parsing should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3512) Update EdgePlan proto for named edge
[ https://issues.apache.org/jira/browse/TEZ-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768614#comment-15768614 ] Bikas Saha commented on TEZ-3512: - I can see that in the patch :) But what will the value be for these null names? I ask because you made a valid point that any system generated names may collide with user defined names. In that case, it better to fail faster (on the client) than later (in the AM). This was not a problem earlier because there were no edge names. Hence we need to be clear about that now. > Update EdgePlan proto for named edge > > > Key: TEZ-3512 > URL: https://issues.apache.org/jira/browse/TEZ-3512 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Zhiyuan Yang >Assignee: Zhiyuan Yang > Attachments: TEZ-3512.1.patch, TEZ-3512.2.patch > > > EdgePlan (protobuf) should have one more field for edge name. Related DAG > plan creation and parsing should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3512) Update EdgePlan proto for named edge
[ https://issues.apache.org/jira/browse/TEZ-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765960#comment-15765960 ] Bikas Saha commented on TEZ-3512: - bq. Default value is inappropriate because any default value may also be used by user What is the solution to the problem then? Even in the AM we can pick a default value since the user may have specified that value as the edge name. Is that correct? If so, isnt it better to check for that on the client side and fail fast (instead of waiting for the job to run and then fail). Rest looks good to me. > Update EdgePlan proto for named edge > > > Key: TEZ-3512 > URL: https://issues.apache.org/jira/browse/TEZ-3512 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Zhiyuan Yang >Assignee: Zhiyuan Yang > Attachments: TEZ-3512.1.patch, TEZ-3512.2.patch > > > EdgePlan (protobuf) should have one more field for edge name. Related DAG > plan creation and parsing should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3512) Update EdgePlan proto for named edge
[ https://issues.apache.org/jira/browse/TEZ-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738699#comment-15738699 ] Bikas Saha commented on TEZ-3512: - When the DAG is being compiled on the client side, a default value could be provided to an edge between v1 and v2 if the edge name is null. In the tests, would be good to have a string s="edge2" and refer to that instead of hard coding "edge2" everywhere. > Update EdgePlan proto for named edge > > > Key: TEZ-3512 > URL: https://issues.apache.org/jira/browse/TEZ-3512 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Zhiyuan Yang >Assignee: Zhiyuan Yang > Attachments: TEZ-3512.1.patch > > > EdgePlan (protobuf) should have one more field for edge name. Related DAG > plan creation and parsing should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709837#comment-15709837 ] Bikas Saha commented on TEZ-3222: - bq. routeInputSourceTaskFailedEventToDestination I think this could be deferred because its not the common case because it applies mainly for failed task event handling. > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch, TEZ-3222.2.patch, TEZ-3222.3.patch, > TEZ-3222.4.patch, TEZ-3222.5.patch, TEZ-3222.6.patch, TEZ-3222.7.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671995#comment-15671995 ] Bikas Saha commented on TEZ-3222: - Sounds good! Thanks! > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch, TEZ-3222.2.patch, TEZ-3222.3.patch, > TEZ-3222.4.patch, TEZ-3222.5.patch, TEZ-3222.6.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1190) Allow multiple edges between two vertices
[ https://issues.apache.org/jira/browse/TEZ-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671986#comment-15671986 ] Bikas Saha commented on TEZ-1190: - Still don't understand why making named/unnamed exclusive is going to help. An example would help. Having the backed completely named would render the new feature enabled/disabled/exclusive/hydrid become an purely client side API thing. Which is naturally seems to be. I have not looked at the code for a while :) With that caveat, its not clear to me why having the AM handle default names makes things easier vs doing it at the client and making the AM agnostic. You are right that since some user code also runs in egde/vm plugins the plugin wrapper layer also needs to have the default case handling. That would be similar to the handling on the client side. Either ways work. I guess the reason I am persisting on this is that I think the separation of concerns would be better in the case where this is handled in the API layer. After all this is more of an API thing (which until now has leaked into the server side). > Allow multiple edges between two vertices > - > > Key: TEZ-1190 > URL: https://issues.apache.org/jira/browse/TEZ-1190 > Project: Apache Tez > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Zhiyuan Yang > Attachments: NamedEdgeDesign.pdf, TEZ-1190.prototype.patch > > > This will be helpful in some scenario. In particular example, we can merge > two small pipelines together in one pair of vertex. Note it is possible the > edge type between the two vertexes are different. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1190) Allow multiple edges between two vertices
[ https://issues.apache.org/jira/browse/TEZ-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652811#comment-15652811 ] Bikas Saha commented on TEZ-1190: - How is the restriction of either all named or unnamed helpful? How about an implementation approach where all implicit behavior is removed in the core layer. And edges are always named. In the DAGClient layer, the new API will provide the name or for backwards compatibility the DAGClient layer will auto-generate uniqueNames (e.g. SourceDestinationCounter). Thus the implicit existing behaviors is limited to the DAGClient layer. Similarly for plugins/contexts, we could add a new API with the edge name semantics instead of overloading the semantics because both parameters (sourceName or edgeName) are string. And we could deprecate the existing semantic API that uses vertex names. A translation layer could handle the implicit conversion of vertexName to auto-generated names produced by the DAGClient. The reason I suggest changing the internal core layer to always use edge names and keep the compatibility handling to the API layers is that might be cleaner cut of the code. And reduce the number of bugs left behind due to missed cases of implicit use. By continuing to support implicit names internally we may increase the surface area of such leaks. Rest looks good to me for now. Nice job with capturing the cases! Of course the devil is in the details :) BTW, the doc implicitly assumes that the dummy vertex approach is being dropped in favor of the named edge approach? > Allow multiple edges between two vertices > - > > Key: TEZ-1190 > URL: https://issues.apache.org/jira/browse/TEZ-1190 > Project: Apache Tez > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Zhiyuan Yang > Attachments: NamedEdgeDesign.pdf, TEZ-1190.prototype.patch > > > This will be helpful in some scenario. In particular example, we can merge > two small pipelines together in one pair of vertex. Note it is possible the > edge type between the two vertexes are different. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1190) Allow multiple edges between two vertices
[ https://issues.apache.org/jira/browse/TEZ-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623227#comment-15623227 ] Bikas Saha edited comment on TEZ-1190 at 10/31/16 8:15 PM: --- +1 for design doc. A while back we had discussed about this and thought that an edge name could be made optional in the edge definition. When the name is specified, its used. If not specified, it defaults to the source/destination as it does today. So existing edges would continue to work with the implicit default. That seemed like a natural extension API-wise. The internal impl would change to always using edge names. The names would be set from the API or implicitly. Would be good to know if this design is being used or a new design is being proposed. Thanks! was (Author: bikassaha): +1 for design doc. A while back we had discussed about this and thought that an edge name could be made optional in the edge definition. When the name is specified, its used. If not specified, it defaults to the source/destination as it does today. So existing edges would continue to work with the implicit default. That seemed like a natural extension API-wise. Would be good to know if this design is being used or a new design is being proposed. Thanks! > Allow multiple edges between two vertices > - > > Key: TEZ-1190 > URL: https://issues.apache.org/jira/browse/TEZ-1190 > Project: Apache Tez > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Zhiyuan Yang > > This will be helpful in some scenario. In particular example, we can merge > two small pipelines together in one pair of vertex. Note it is possible the > edge type between the two vertexes are different. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1190) Allow multiple edges between two vertices
[ https://issues.apache.org/jira/browse/TEZ-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623227#comment-15623227 ] Bikas Saha commented on TEZ-1190: - +1 for design doc. A while back we had discussed about this and thought that an edge name could be made optional in the edge definition. When the name is specified, its used. If not specified, it defaults to the source/destination as it does today. So existing edges would continue to work with the implicit default. That seemed like a natural extension API-wise. Would be good to know if this design is being used or a new design is being proposed. Thanks! > Allow multiple edges between two vertices > - > > Key: TEZ-1190 > URL: https://issues.apache.org/jira/browse/TEZ-1190 > Project: Apache Tez > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Zhiyuan Yang > > This will be helpful in some scenario. In particular example, we can merge > two small pipelines together in one pair of vertex. Note it is possible the > edge type between the two vertexes are different. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15590836#comment-15590836 ] Bikas Saha commented on TEZ-3222: - {code} -return commonRouteMeta[sourceTaskIndex]; +return CompositeEventRouteMetadata.create(1, sourceTaskIndex, 0); {code} The removed code is looking up an array indexed by sourceTaskIndex while the new code is directly using the sourceTaskIndex. Is there a difference? Also, reusing the caching (as done earlier) may improve critical path CPU for object creation. Though for broadcast edge I am not sure if CDME is used as of now. {code} +message CompositeRoutedDataMovementEventProto { + optional int32 source_index = 1; + optional int32 target_index = 2; + optional int32 count = 3; + optional bytes user_payload = 4; + optional int32 version = 5; +}{code} Can we create a message for CompositeRouteMeta and use it vs expanding its contents. That way CompositeRouteMeta could evolve independently. {code} if (event instanceof DataMovementEvent) { numDmeEvents.incrementAndGet(); - processDataMovementEvent((DataMovementEvent)event); + DataMovementEvent dmEvent = (DataMovementEvent)event; + DataMovementEventPayloadProto shufflePayload; + try { +shufflePayload = DataMovementEventPayloadProto.parseFrom(ByteString.copyFrom(dmEvent.getUserPayload())); + } catch (InvalidProtocolBufferException e) { +throw new TezUncheckedException("Unable to parse DataMovementEvent payload", e); + } + BitSet emptyPartitionsBitSet = null; + if (shufflePayload.hasEmptyPartitions()) { +try { + byte[] emptyPartitions = TezCommonUtils.decompressByteStringToByteArray(shufflePayload.getEmptyPartitions(), inflater); {code} I dont think DME's dont have empty partition bitset since they dont have multi-partition data. Right? [~rajesh.balamohan] Rest looks good to me. +1 Thanks! > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch, TEZ-3222.2.patch, TEZ-3222.3.patch, > TEZ-3222.4.patch, TEZ-3222.5.patch, TEZ-3222.6.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3163) Reuse and tune Inflaters and Deflaters to speed DME processing
[ https://issues.apache.org/jira/browse/TEZ-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504112#comment-15504112 ] Bikas Saha commented on TEZ-3163: - /cc [~hitesh] [~aplusplus] > Reuse and tune Inflaters and Deflaters to speed DME processing > -- > > Key: TEZ-3163 > URL: https://issues.apache.org/jira/browse/TEZ-3163 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3163.1-branch-0.7.patch, TEZ-3163.1.patch, > TEZ-3163.2.patch, TEZ-3163.PERF.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3388) Provide error information in shuffle response header
[ https://issues.apache.org/jira/browse/TEZ-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-3388: Description: In MR shuffle, if any partition has an error then the reader gets an exception while reading the response stream and loses all the data for all partitions. Instead if the shuffle response header had more metadata then errors could be handled more efficiently. See YARN-1773 for history. (was: In MR shuffle, if any partition has an error then the reader gets an exception while reading the response stream and loses all the data for all partitions. Instead if the shuffle response header had more metadata then errors could be handled more efficiently.) > Provide error information in shuffle response header > > > Key: TEZ-3388 > URL: https://issues.apache.org/jira/browse/TEZ-3388 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Bikas Saha > > In MR shuffle, if any partition has an error then the reader gets an > exception while reading the response stream and loses all the data for all > partitions. Instead if the shuffle response header had more metadata then > errors could be handled more efficiently. See YARN-1773 for history. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3388) Provide error information in shuffle response header
Bikas Saha created TEZ-3388: --- Summary: Provide error information in shuffle response header Key: TEZ-3388 URL: https://issues.apache.org/jira/browse/TEZ-3388 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha In MR shuffle, if any partition has an error then the reader gets an exception while reading the response stream and loses all the data for all partitions. Instead if the shuffle response header had more metadata then errors could be handled more efficiently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3317) Speculative execution starts too early due to 0 progress
[ https://issues.apache.org/jira/browse/TEZ-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383436#comment-15383436 ] Bikas Saha commented on TEZ-3317: - Sorry I did not understand whats the issue here from the above comment. Is task progress = 0 always or occasionally? What the is flow in the buggy situation? Is it for cases where the processor makes no progress because the input is slow and because input progress is not available to the processor, it reports 0 progress overall for a long time. > Speculative execution starts too early due to 0 progress > > > Key: TEZ-3317 > URL: https://issues.apache.org/jira/browse/TEZ-3317 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jonathan Eagles > > Don't know at this point if this is a tez or a PigProcessor issue. There is > some setProgress chain that is keeping task progress from being correctly > reported. Task status is always zero, so as soon as the first task finishes, > tasks up to the speculation limit are always launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3334) Tez Custom Shuffle Handler
[ https://issues.apache.org/jira/browse/TEZ-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374071#comment-15374071 ] Bikas Saha commented on TEZ-3334: - Also reporting errors properly in the response such that 1 error does not corrupt the entire data stream. YARN-1773. > Tez Custom Shuffle Handler > -- > > Key: TEZ-3334 > URL: https://issues.apache.org/jira/browse/TEZ-3334 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jonathan Eagles > > For conditions where auto-parallelism is reduced (e.g. TEZ-3222), a custom > shuffle handler could help reduce the number of fetches and could more > efficiently fetch data. In particular if a reducer is fetching 100 pieces > serially from the same mapper it could do this in one fetch call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3334) Tez Custom Shuffle Handler
[ https://issues.apache.org/jira/browse/TEZ-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374065#comment-15374065 ] Bikas Saha commented on TEZ-3334: - YARN-4577 for classpath isolation of aux services. Perhaps the first thing could be the POC. Which is take existing MR shuffle and change its packaging to org.apache.tez. Then add it as tez_shuffle in YARN alongside mapreduce_shuffle. And verify that tez jobs use Tez shuffle and MR jobs use MR shuffle (both shuffle services running the same code effectively). After that we can create follow up jiras for new features and improvements to tez shuffle. Sounds like a plan? > Tez Custom Shuffle Handler > -- > > Key: TEZ-3334 > URL: https://issues.apache.org/jira/browse/TEZ-3334 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jonathan Eagles > > For conditions where auto-parallelism is reduced (e.g. TEZ-3222), a custom > shuffle handler could help reduce the number of fetches and could more > efficiently fetch data. In particular if a reducer is fetching 100 pieces > serially from the same mapper it could do this in one fetch call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1248) Reduce slow-start should special case 1 reducer runs
[ https://issues.apache.org/jira/browse/TEZ-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371643#comment-15371643 ] Bikas Saha edited comment on TEZ-1248 at 7/11/16 9:16 PM: -- lgtm. seems like a simple code change whose side-effect produces the results for this jira :P +1. Thanks for the fix! was (Author: bikassaha): lgtm. seems like a simple code change whose side-effect produces the results for this jira :P > Reduce slow-start should special case 1 reducer runs > > > Key: TEZ-1248 > URL: https://issues.apache.org/jira/browse/TEZ-1248 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.0 > Environment: 20 node cluster running tez >Reporter: Gopal V >Assignee: Zhiyuan Yang >Priority: Critical > Attachments: TEZ-1248.1.patch > > > Reducer slow-start has a performance problem for the small cases where there > is just 1 reducer for a case with a single wave. > Tez knows the split count and wave count, being able to determine if the > cluster has enough spare capacity to run the reducer earlier for lower > latency in a N-mapper -> 1 reducer case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1248) Reduce slow-start should special case 1 reducer runs
[ https://issues.apache.org/jira/browse/TEZ-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371643#comment-15371643 ] Bikas Saha commented on TEZ-1248: - lgtm. seems like a simple code change whose side-effect produces the results for this jira :P > Reduce slow-start should special case 1 reducer runs > > > Key: TEZ-1248 > URL: https://issues.apache.org/jira/browse/TEZ-1248 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.0 > Environment: 20 node cluster running tez >Reporter: Gopal V >Assignee: Zhiyuan Yang >Priority: Critical > Attachments: TEZ-1248.1.patch > > > Reducer slow-start has a performance problem for the small cases where there > is just 1 reducer for a case with a single wave. > Tez knows the split count and wave count, being able to determine if the > cluster has enough spare capacity to run the reducer earlier for lower > latency in a N-mapper -> 1 reducer case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3334) Tez Custom Shuffle Handler
[ https://issues.apache.org/jira/browse/TEZ-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371277#comment-15371277 ] Bikas Saha commented on TEZ-3334: - +1. The new YARN aux service isolation work should make this easier to deploy alongside the existing MR shuffle while we iron things out. > Tez Custom Shuffle Handler > -- > > Key: TEZ-3334 > URL: https://issues.apache.org/jira/browse/TEZ-3334 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles > > For conditions where auto-parallelism is reduced (e.g. TEZ-3222), a custom > shuffle handler could help reduce the number of fetches and could more > efficiently fetch data. In particular if a reducer is fetching 100 pieces > serially from the same mapper it could do this in one fetch call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3287) Have UnorderedPartitionedKVWriter honor tez.runtime.empty.partitions.info-via-events.enabled
[ https://issues.apache.org/jira/browse/TEZ-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351682#comment-15351682 ] Bikas Saha commented on TEZ-3287: - [~rajesh.balamohan] [~sseth] please help review > Have UnorderedPartitionedKVWriter honor > tez.runtime.empty.partitions.info-via-events.enabled > > > Key: TEZ-3287 > URL: https://issues.apache.org/jira/browse/TEZ-3287 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-3287.001.patch > > > The ordered partitioned output allows applications to specify if empty > partition stats should be included as part of DataMovementEvent via a > configuration. It seems unordered partitioned output should honor that > configuration as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342583#comment-15342583 ] Bikas Saha commented on TEZ-3291: - Sure. lets create a follow up jira. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, > TEZ-3291.5.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335081#comment-15335081 ] Bikas Saha commented on TEZ-3296: - Thanks! Its clear now. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.7.2, 0.9.0, 0.8.4 > > Attachments: TEZ-3296.001.patch, taskschedulerlog > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334623#comment-15334623 ] Bikas Saha commented on TEZ-3296: - Ah. Looks like a result of using priority as a key for unique requests vs using it a just priority. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.7.2, 0.9.0, 0.8.4 > > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334623#comment-15334623 ] Bikas Saha edited comment on TEZ-3296 at 6/16/16 8:29 PM: -- Ah. Looks like a result of using priority as a key for unique requests vs using it a just priority. Its one thing to not support multiple resource sizes at the same priority and another to lose such requests altogether. Sigh! /cc [~vinodkv] [~wangda] was (Author: bikassaha): Ah. Looks like a result of using priority as a key for unique requests vs using it a just priority. Sigh! /cc [~vinodkv] [~wangda] > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.7.2, 0.9.0, 0.8.4 > > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334623#comment-15334623 ] Bikas Saha edited comment on TEZ-3296 at 6/16/16 8:29 PM: -- Ah. Looks like a result of using priority as a key for unique requests vs using it a just priority. Sigh! /cc [~vinodkv] [~wangda] was (Author: bikassaha): Ah. Looks like a result of using priority as a key for unique requests vs using it a just priority. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 0.7.2, 0.9.0, 0.8.4 > > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334492#comment-15334492 ] Bikas Saha edited comment on TEZ-3296 at 6/16/16 7:20 PM: -- Sure. Lets commit this patch. Could you please attach the task scheduler logs for the hung job and mention conflicting vertices? I follow what you described above and I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G because 2G > 1.5G and the matching heuristic in AMRMClient considers fits-in vs exact match because the RM is always guaranteed to return a container thats larger than requested due to rounding. E.g. if the min container size is 1G then asking for 1.5G will return 2G containers and the situation would still be the same for the vertex1.5G in the AM. One reason why I think it may hang is if the RM returns x+y containers at 1.5G because then y containers for vertex2G would never get a match. Or the RM returns less then x+y containers at 2G. The second case would be a bad RM bug that should be fixed in YARN urgently. The AM logs would shed some light on this. was (Author: bikassaha): Sure. Lets commit this patch. Could you please attach the task scheduler logs for the hung job and mention conflicting vertices? I follow what you described above and I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G because 2G > 1.5G and the matching heuristic in AMRMClient considers fits-in vs exact match because the RM is always guaranteed to return a container that larger then requested due to rounding. E.g. if the min container size is 1G then asking for 1.5G will return 2G containers and the situation would still be the same for the vertex1.5G in the AM. One reason why I think it may hang is if the RM returns x+y containers at 1.5G because then y containers for vertex2G would never get a match. Or the RM returns less then x+y containers at 2G. The second case would be a bad RM bug that should be fixed in YARN urgently. The AM logs would shed some light on this. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334492#comment-15334492 ] Bikas Saha edited comment on TEZ-3296 at 6/16/16 7:20 PM: -- Sure. Lets commit this patch. Could you please attach the task scheduler logs for the hung job and mention conflicting vertices? I follow what you described above and I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G because 2G > 1.5G and the matching heuristic in AMRMClient considers fits-in vs exact match because the RM is always guaranteed to return a container that larger then requested due to rounding. E.g. if the min container size is 1G then asking for 1.5G will return 2G containers and the situation would still be the same for the vertex1.5G in the AM. One reason why I think it may hang is if the RM returns x+y containers at 1.5G because then y containers for vertex2G would never get a match. Or the RM returns less then x+y containers at 2G. The second case would be a bad RM bug that should be fixed in YARN urgently. The AM logs would shed some light on this. was (Author: bikassaha): Sure. Lets commit this patch. Could you please attach the task scheduler logs for the hung job and mention conflicting vertices? I follow what you described above and I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G because 2G > 1.5G and the matching heuristic in AMRMClient considers fitsIn vs exact match because the RM is always guaranteed to return a container that larger then requested due to rounding. E.g. if the min container size is 1G then asking for 1.5G will return 2G containers and the situation would still be the same for the vertex1.5G in the AM. One reason why I think it may hang is if the RM returns x+y containers at 1.5G because then y containers for vertex2G would never get a match. Or the RM returns less then x+y containers at 2G. The second case would be a bad RM bug that should be fixed in YARN urgently. The AM logs would shed some light on this. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334492#comment-15334492 ] Bikas Saha commented on TEZ-3296: - Sure. Lets commit this patch. Could you please attach the task scheduler logs for the hung job and mention conflicting vertices? I follow what you described above and I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G because 2G > 1.5G and the matching heuristic in AMRMClient considers fitsIn vs exact match because the RM is always guaranteed to return a container that larger then requested due to rounding. E.g. if the min container size is 1G then asking for 1.5G will return 2G containers and the situation would still be the same for the vertex1.5G in the AM. One reason why I think it may hang is if the RM returns x+y containers at 1.5G because then y containers for vertex2G would never get a match. Or the RM returns less then x+y containers at 2G. The second case would be a bad RM bug that should be fixed in YARN urgently. The AM logs would shed some light on this. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328921#comment-15328921 ] Bikas Saha commented on TEZ-3296: - Sorry. My bad. I even used a calculator for that :P If this is urgent I think we can go with the current proposal. Would be good to open a follow up item to use a BFS or topo-sort based method that uses the priority space more conservatively. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328326#comment-15328326 ] Bikas Saha commented on TEZ-3291: - I am with Gopal on the fragility of this workaround. Single machine is affected. We assume localhost will not be used but it could. [~gopalv] [~rajesh.balamohan] can we please evaluate an extension of fileSizeEstimator or something similar to handled this. My gut feeling is that this is not the first s3 related issue we will hit and having an abstraction in place might make handling future issues better. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, > TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326689#comment-15326689 ] Bikas Saha commented on TEZ-3291: - The comment could be more explicit like "this is a workaround for systems like S3 that pass the same fake hostname for all splits" The log could log the newDesiredSplits and also the final value of desired splits such that we get all the info in one log. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326684#comment-15326684 ] Bikas Saha commented on TEZ-3291: - Would the split not have the URLs with S3 in them? Wondering how ORC split estimator works? If it cases the spit into ORCSplit and inspects internal members then perhaps the S3 split could also be cast into the correct object to look at the URLs? > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326673#comment-15326673 ] Bikas Saha commented on TEZ-3296: - bq. Today each vertex uses a set of three priority values, the low, the high, and the mean of those two. (Oddly containers for high are never requested in practice, just the low and mean.) The middle priority is default. The lower value (higher pri) is for failed task reruns. The higher value (lower pri) was intended was speculative tasks but may have been missed being used for that. Wondering why the app was hung. IIRC YARN keeps the higher resource request when there are multiple at the same priority because thats the safer thing to do. So when 2 vertices have the same priority but different resources then we would expect to get containers for both but with the higher resource value across the board. If the above is correct then perhaps there is a bug in the task scheduler code that needs to get fixed which we might miss if we change the vertex priorities to be unique as a workaround. The vertex priority change is good in its own right. But would be good to make sure we dont have some pending bug in the task scheduler that may have other side effects. Could you please attach the task scheduler log for the job that hung in case that has some clues. On the patch itself the formula looks like (Height*Total*3) + V*3. Now - (1*24*3) + 20*3 = 150 = (2*24*3) + 2*3 So we could still have collisions depending on the manner in which vertexIds get assigned, right? Unless currently we are getting lucky in the vId assignment such that vertices close to the root also happen to get low ids. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce
[ https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326660#comment-15326660 ] Bikas Saha commented on TEZ-3297: - looking at the code further, looks like the crucial change is not holding own vertex lock while trying to read src/dest vertex lock. that makes sense and seems like a lock ordering issue waiting to happen. Perhaps a quick scan of such nested locking is in order in case not already done. The removal of the overall lock is fine since each internal method invocation like getTotalTasks() are already handling their own locking. lgtm. Moving VM invoked sync calls onto the dispatcher is a good idea but would need the addition of new callbacks into the VM to notify them of completion of the requested vertex state change operation. Since most current VMs dont do much after changing parallelism, the change might be simpler to implement now. Not sure about Hive custom VMs. > Deadlock scenario in AM during ShuffleVertexManager auto reduce > --- > > Key: TEZ-3297 > URL: https://issues.apache.org/jira/browse/TEZ-3297 > Project: Apache Tez > Issue Type: Bug >Reporter: Zhiyuan Yang >Priority: Critical > Attachments: TEZ-3297.1.patch, TEZ-3297.2.patch, am_log, thread_dump > > > Here is what's happening in the attached thread dump. > App Pool thread #9 does the auto reduce on V2 and initializes the new edge > manager, it holds the V2 write lock and wants read lock of source vertex V1. > At the same time, another App Pool thread #2 schedules a task of V1 and gets > the output spec, so it holds the V1 read lock and wants V2 read lock. > Also, dispatcher thread wants the V1 write lock to begin the state machine > transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, > thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. > This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, > and #9 blocks #2. > There is no problem with ReadWriteLock behavior in this case. Please see this > java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3216) Support for more precise partition stats in VertexManagerEvent
[ https://issues.apache.org/jira/browse/TEZ-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326650#comment-15326650 ] Bikas Saha commented on TEZ-3216: - /cc [~rajesh.balamohan] in case he is interested in this optimization. > Support for more precise partition stats in VertexManagerEvent > -- > > Key: TEZ-3216 > URL: https://issues.apache.org/jira/browse/TEZ-3216 > Project: Apache Tez > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: TEZ-3216.patch > > > Follow up on TEZ-3206 discussion, at least for some use cases, more accurate > partition stats will be useful for DataMovementEvent routing. Maybe we can > provide a config option to allow apps to choose the more accurate partition > stats over RoaringBitmap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326649#comment-15326649 ] Bikas Saha commented on TEZ-3291: - Why the numLoc=1 check only in the size < min case? A comment before the code, explaining the above workaround would be useful. Also a log statement. This may affect single node cases because numLoc=1 in that case too. Is there any way we can find out if the splits are coming from an S3 like source and use that information instead. E.g. something similar to splitSizeEstimator that can look at the split and return if its locations are potentially fake. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3300) Tez UI: A wiki must be created with info about each page in Tez UI
[ https://issues.apache.org/jira/browse/TEZ-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326638#comment-15326638 ] Bikas Saha commented on TEZ-3300: - Could pages to the wiki be linked directly from the UI page for quick access? > Tez UI: A wiki must be created with info about each page in Tez UI > -- > > Key: TEZ-3300 > URL: https://issues.apache.org/jira/browse/TEZ-3300 > Project: Apache Tez > Issue Type: Bug >Reporter: Sreenath Somarajapuram > > - It would be a page under Tez confluence > - Must be flexible enough to support different versions of Tez UI, and give > context based help. > - Add a section on understanding various errors displayed in the error-bar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3300) Tez UI: A wiki must be created with info about each page in Tez UI
[ https://issues.apache.org/jira/browse/TEZ-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326638#comment-15326638 ] Bikas Saha edited comment on TEZ-3300 at 6/12/16 9:22 PM: -- Could pages to the wiki be linked directly from the corresponding UI pages for quick access? was (Author: bikassaha): Could pages to the wiki be linked directly from the UI page for quick access? > Tez UI: A wiki must be created with info about each page in Tez UI > -- > > Key: TEZ-3300 > URL: https://issues.apache.org/jira/browse/TEZ-3300 > Project: Apache Tez > Issue Type: Bug >Reporter: Sreenath Somarajapuram > > - It would be a page under Tez confluence > - Must be flexible enough to support different versions of Tez UI, and give > context based help. > - Add a section on understanding various errors displayed in the error-bar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325748#comment-15325748 ] Bikas Saha commented on TEZ-3291: - [~rajesh.balamohan] Is the patch still WIP or ready for final review? > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements
[ https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324989#comment-15324989 ] Bikas Saha commented on TEZ-3296: - Could you please help me understand the logic to make these unique. I am sorry I could not follow from the code :) The minimum solution would be to break ties when needed such that each vertex has a unique priority. Right now vertex depth from root is proxying the priority. Instead we could do a BFS on the DAG and assign priority based on the traversal. Or we could reuse the topological sort in the client (done during DAG submission) and assign that as the priority of the vertex. > Tez job can hang if two vertices at the same root distance have different > task requirements > --- > > Key: TEZ-3296 > URL: https://issues.apache.org/jira/browse/TEZ-3296 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3296.001.patch > > > When two vertices have the same distance from the root Tez will schedule > containers with the same priority. However those vertices could have > different task requirements and therefore different capabilities. As > documented in YARN-314, YARN currently doesn't support requests for multiple > sizes at the same priority. In practice this leads to one vertex allocation > requests clobbering the other, and that can result in a situation where the > Tez AM is waiting on containers it will never receive from the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce
[ https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324981#comment-15324981 ] Bikas Saha commented on TEZ-3297: - I am not sure we can simply remove the lock since it may affect visibility. Also the assumption that task count wont change may be inaccurate in the future. With progressive creation of splits task count may change with time. Similarly input output specs are theoretically pluggable and different per task. Lets be cautious wrt these future features when fixing this issue else we may forget about it later on. A deadlock could sometimes be better than wrong results :) > Deadlock scenario in AM during ShuffleVertexManager auto reduce > --- > > Key: TEZ-3297 > URL: https://issues.apache.org/jira/browse/TEZ-3297 > Project: Apache Tez > Issue Type: Bug >Reporter: Zhiyuan Yang >Priority: Critical > Attachments: TEZ-3297.1.patch, TEZ-3297.2.patch, am_log, thread_dump > > > Here is what's happening in the attached thread dump. > App Pool thread #9 does the auto reduce on V2 and initializes the new edge > manager, it holds the V2 write lock and wants read lock of source vertex V1. > At the same time, another App Pool thread #2 schedules a task of V1 and gets > the output spec, so it holds the V1 read lock and wants V2 read lock. > Also, dispatcher thread wants the V1 write lock to begin the state machine > transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, > thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. > This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, > and #9 blocks #2. > There is no problem with ReadWriteLock behavior in this case. Please see this > java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323264#comment-15323264 ] Bikas Saha commented on TEZ-3291: - I will take a quick look at the patch by EOD. Looks like the main issue was that there was some split size heuristic that needed an update to account for cases where locations are invalid. The patch is using distinctlocations=1 as a proxy for invalid locations. Unless this negatively affects a real single node cluster scenario, this should be fine. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319263#comment-15319263 ] Bikas Saha commented on TEZ-3291: - Then that would be a bug to fix. Hopefully thats what the patch is doing. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318941#comment-15318941 ] Bikas Saha commented on TEZ-3291: - IIRC they should because localhost will be treated as a valid machine name and it will group as if all splits are on the same machine. The code itself does the same thing by adding a same bogus machine location name for all splits that have no location. Thereafter the code works identically for splits that have real locations and other that have fake locations. > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available
[ https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316870#comment-15316870 ] Bikas Saha commented on TEZ-3291: - Since the data fits within the max size for a grouped split its creating 1 split. Whats the issue here? > Optimize splits grouping when locality information is not available > --- > > Key: TEZ-3291 > URL: https://issues.apache.org/jira/browse/TEZ-3291 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Minor > Attachments: TEZ-3291.WIP.patch > > > There are scenarios where splits might not contain the location details. S3 > is an example, where all splits would have "localhost" for the location > details. In such cases, curent split computation does not go through the > rack local and allow-small groups optimizations and ends up creating small > number of splits. Depending on clusters this can end creating long running > map jobs. > Example with hive: > == > 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small > table. > 2. With query-22, hive requests with the original splits count as 52 and > overall length of splits themselves is around 12061817 bytes. > {{tez.grouping.min-size}} was set to 16 MB. > 3. In tez splits grouping, this ends up creating a single split with 52+ > files be processed in the split. In clusters with split locations, this > would have landed up with multiple splits since {{allowSmallGroups}} would > have kicked in. > But in S3, since everything would have "localhost" all splits get added to > single group. This makes things a lot worse. > 4. Depending on the dataset and the format, this can be problematic. For > instance, file open calls and random seeks can be expensive in S3. > 5. In this case, 52 files have to be opened and processed by single task in > sequential fashion. Had it been processed by multiple tasks, response time > would have drastically reduced. > E.g log details > {noformat} > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Grouping splits in Tez > 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired splits: 110 too large. Desired > splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total > length: 12061817 Original splits: 52 > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 > numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: > 12061817 numOriginalSplits: 52 . Grouping by length: true count: false > 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] > |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 > splitsProcessed: 52 > {noformat} > Alternate options: > == > 1. Force Hadoop to provide bogus locations for S3. But not sure, if that > would be accepted anytime soon. Ref: HADOOP-12878 > 2. Set {{tez.grouping.min-size}} to very very low value. But should the end > user always be doing this on query to query basis? > 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute > desiredNumSplits only when number of distinct locations in the splits is > 1. > This would force more number of splits to be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3271) Provide mapreduce failures.maxpercent equivalent
[ https://issues.apache.org/jira/browse/TEZ-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1536#comment-1536 ] Bikas Saha commented on TEZ-3271: - It will help if there is a bit more detail on whats the objective herein? > Provide mapreduce failures.maxpercent equivalent > > > Key: TEZ-3271 > URL: https://issues.apache.org/jira/browse/TEZ-3271 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3271.1.patch, TEZ-3271.2.patch, TEZ-3271.3.patch > > > mapreduce.map.failures.maxpercent > mapreduce.reduce.failures.maxpercent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3274) Vertex with MRInput and shuffle input does not respect slow start
[ https://issues.apache.org/jira/browse/TEZ-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302549#comment-15302549 ] Bikas Saha commented on TEZ-3274: - There probably isnt. We could use this one. Or if you need an urgent point fix in this jira some scheduling heuristics could be added optionally to RootInputInitializer. Though I am not sure what exactly is happening. Since these tasks also read data from HDFS why would be not want them to start asap if there is spare capacity. Slow start is effectively also tries to start tasks as soon as possible (in fact sooner than its inputs are ready so I am not sure why it was called slow start when it could have been called eager start :) ). > Vertex with MRInput and shuffle input does not respect slow start > - > > Key: TEZ-3274 > URL: https://issues.apache.org/jira/browse/TEZ-3274 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles > > Vertices with shuffle input and MRInput choose RootInputVertexManager (and > not ShuffleVertexManager) and start containers and tasks immediately. In this > scenario, resources can be wasted since they do not respect > tez.shuffle-vertex-manager.min-src-fraction > tez.shuffle-vertex-manager.max-src-fraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3274) Vertex with MRInput and shuffle input does not respect slow start
[ https://issues.apache.org/jira/browse/TEZ-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300753#comment-15300753 ] Bikas Saha commented on TEZ-3274: - This is a known limitation. The ideal solution is to split the VertexManager from a monolith to an composition of vertex modifier and scheduler. The root input manager is a modifier. Auto-reduce is a modifier. Slow start is a scheduler. > Vertex with MRInput and shuffle input does not respect slow start > - > > Key: TEZ-3274 > URL: https://issues.apache.org/jira/browse/TEZ-3274 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles > > Vertices with shuffle input and MRInput choose RootInputVertexManager (and > not ShuffleVertexManager) and start containers and tasks immediately. In this > scenario, resources can be wasted since they do not respect > tez.shuffle-vertex-manager.min-src-fraction > tez.shuffle-vertex-manager.max-src-fraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2950) Poor performance of UnorderedPartitionedKVWriter
[ https://issues.apache.org/jira/browse/TEZ-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291917#comment-15291917 ] Bikas Saha commented on TEZ-2950: - bq. 2. Rely on pipelined shuffle to avoid the final merge. Per old discussion with [~rajesh.balamohan] avoiding final merge is independent of pipeline shuffle and could be enabled without it (this needs code change though). Perhaps what you allude to in 4. > Poor performance of UnorderedPartitionedKVWriter > > > Key: TEZ-2950 > URL: https://issues.apache.org/jira/browse/TEZ-2950 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Kuhu Shukla > Attachments: TEZ-2950.001_prelim.patch > > > Came across a job which was taking a long time in > UnorderedPartitionedKVWriter.mergeAll. It was decompressing and reading data > from spill files (8500 spills) and then writing the final compressed merge > file. Why do we need spill files for UnorderedPartitionedKVWriter? Why not > just buffer and keep directly writing to the final file which will save a lot > of time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290228#comment-15290228 ] Bikas Saha commented on TEZ-3222: - Thanks for the update! And sorry for the delayed response. {code}@@ -78,10 +78,10 @@ public class BroadcastEdgeManager extends EdgeManagerPluginOnDemand { } @Override - public EventRouteMetadata routeCompositeDataMovementEventToDestination( + public CompositeEventRouteMetadata routeCompositeDataMovementEventToDestination( int sourceTaskIndex, int destinationTaskIndex) throws Exception { -return commonRouteMeta[sourceTaskIndex]; +return CompositeEventRouteMetadata.create(1, sourceTaskIndex, 0); }{code} This should probably used the same caching logic instead creating new objects. {code} @@ -360,8 +360,8 @@ public class ShuffleVertexManager extends VertexManagerPlugin { partitionRange = basePartitionRange; } - return EventRouteMetadata.create(partitionRange, targetIndicesToSend, - sourceIndices[destinationTaskIndex]); + return CompositeEventRouteMetadata.create(partitionRange, targetIndicesToSend[0], + sourceIndices[destinationTaskIndex][0]); }{code} This is not clear to me. The main reason for array type in EventRouteMetadata is this auto-reduce edge manager case where a single source CDME expands to multiple DMEs for the same destination task where the expansion number is the number of partitions coalesced during auto-reduce. Hence its not clear how passing the first element in the array would work. If the above is true then perhaps we could look at adding EventRouteMetadata at a member of CDME cloned from the source for the destination. And in the destination, the CDME with route metadata gets expanded into DMEs in the same manner as the following code in Edge (which could be moved into a helper method on CDME) {code}- int numEvents = routeMeta.getNumEvents(); - int[] sourceIndices = routeMeta.getSourceIndices(); - int[] targetIndices = routeMeta.getTargetIndices(); - while (numEventsDone < numEvents && listSize++ < listMaxSize) { -DataMovementEvent e = compEvent.expand(sourceIndices[numEventsDone], -targetIndices[numEventsDone]); -numEventsDone++; -TezEvent tezEventToSend = new TezEvent(e, tezEvent.getSourceInfo(), -tezEvent.getEventReceivedTime()); -tezEventToSend.setDestinationInfo(destinationMetaInfo); -listToAdd.add(tezEventToSend); - }{code} This would also keep the API unchanged for the edge plugin. Does the above sound correct to you? I am looking at this code after a while and I may have gotten it all wrong :) The code change in all inputs look quite similar to each other. Any potential for common methods? > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch, TEZ-3222.2.patch, TEZ-3222.3.patch, > TEZ-3222.4.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3242) Reduce bytearray copy with TezEvent Serialization and deserialization
[ https://issues.apache.org/jira/browse/TEZ-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280723#comment-15280723 ] Bikas Saha commented on TEZ-3242: - lgtm. > Reduce bytearray copy with TezEvent Serialization and deserialization > - > > Key: TEZ-3242 > URL: https://issues.apache.org/jira/browse/TEZ-3242 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.7.2, 0.8.4 > > Attachments: TEZ-3242-1.patch > > > Byte arrays are created for serializing protobuf messages and parsing them > which creates lot of garbage when we have lot of events. > {code} > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3236) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.tez.runtime.api.impl.TezEvent.serializeEvent(TezEvent.java:197) > at org.apache.tez.runtime.api.impl.TezEvent.write(TezEvent.java:268) > at > org.apache.tez.runtime.api.impl.TezHeartbeatResponse.write(TezHeartbeatResponse.java:95) > at > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:202) > at > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:128) > at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:82) > at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2496) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3244) Allow overlap of input and output memory when they are not concurrent
[ https://issues.apache.org/jira/browse/TEZ-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274621#comment-15274621 ] Bikas Saha commented on TEZ-3244: - Nice idea! This definitely works for the case where the processor is blocking on all the IOs before it outputs anything. Not sure if other users like Hive always have that behavior. e.g. it could block on an input and then stream through the other input while stream to the output. Does this need any API on the processor that says which mode to use because in the same job some vertices could need all IOs in parallel vs some could allow first Is and then Os. How do we mitigate the risk that exists with any of such approaches that dont give memory to an IO when it needs it and causes that IO to not function/crash due to starvation? > Allow overlap of input and output memory when they are not concurrent > - > > Key: TEZ-3244 > URL: https://issues.apache.org/jira/browse/TEZ-3244 > Project: Apache Tez > Issue Type: Bug >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3244.001.patch > > > For cases when memory for inputs and outputs are not needed simultaneously it > would be more efficient to allow inputs to use the memory normally set aside > for outputs and vice-versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3239) ShuffleVertexManager recovery issue when auto parallelism is enabled
[ https://issues.apache.org/jira/browse/TEZ-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267913#comment-15267913 ] Bikas Saha commented on TEZ-3239: - Barring a bug, this should not be happening in the new recovery design. Thats because after a vertex has been reconfigured, the new AM attempt will start a NoOp Vertex Manager. > ShuffleVertexManager recovery issue when auto parallelism is enabled > > > Key: TEZ-3239 > URL: https://issues.apache.org/jira/browse/TEZ-3239 > Project: Apache Tez > Issue Type: Bug >Reporter: Ming Ma > > Repro: > * Enable {{tez.shuffle-vertex-manager.enable.auto-parallel}}. > * kill the Tez AM container after the job has reached to the point that VM > has reconfigured the Edge. > * The new Tez AM attempt will fail to the following error. > {noformat} > org.apache.tez.dag.api.TezUncheckedException: Atleast 1 bipartite source > should exist > at > org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager.onVertexStarted(ShuffleVertexManager.java:497) > at > org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventOnVertexStarted.invoke(VertexManager.java:589) > at > org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:658) > at > org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:653) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > {noformat} > That is because the edge routing type changed to {{DataMovementType.CUSTOM}} > after reconfiguration. Allowing {{DataMovementType.CUSTOM}} in the following > check seems to fix the issue. > {noformat} > if (entry.getValue().getDataMovementType() == > DataMovementType.SCATTER_GATHER) { > bipartiteSources++; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
[ https://issues.apache.org/jira/browse/TEZ-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261376#comment-15261376 ] Bikas Saha commented on TEZ-3203: - Now looking at the full code based on the findbugs I think I dont know what I am talking about :). The last patch is not needed. Patch number 2 from Jason is good to go. > DAG hangs when one of the upstream vertices has zero tasks > -- > > Key: TEZ-3203 > URL: https://issues.apache.org/jira/browse/TEZ-3203 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3203.001.patch, TEZ-3203.002.patch, TEZ-3203.3.patch > > > A DAG hangs during execution if it has a vertex with multiple inputs and one > of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
[ https://issues.apache.org/jira/browse/TEZ-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261136#comment-15261136 ] Bikas Saha commented on TEZ-3203: - Uploaded new patch. Credit for the jira and patch goes to Jason entirely. > DAG hangs when one of the upstream vertices has zero tasks > -- > > Key: TEZ-3203 > URL: https://issues.apache.org/jira/browse/TEZ-3203 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3203.001.patch, TEZ-3203.002.patch, TEZ-3203.3.patch > > > A DAG hangs during execution if it has a vertex with multiple inputs and one > of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
[ https://issues.apache.org/jira/browse/TEZ-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-3203: Attachment: TEZ-3203.3.patch > DAG hangs when one of the upstream vertices has zero tasks > -- > > Key: TEZ-3203 > URL: https://issues.apache.org/jira/browse/TEZ-3203 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3203.001.patch, TEZ-3203.002.patch, TEZ-3203.3.patch > > > A DAG hangs during execution if it has a vertex with multiple inputs and one > of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
[ https://issues.apache.org/jira/browse/TEZ-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261125#comment-15261125 ] Bikas Saha commented on TEZ-3203: - My bad. I should have been more clear. The following would be safer than removing the pendingTasks check altogether to handled the (potentially impossible) case that pending tasks is still not initialized from a -1 value. I can make the change and post the final patch. {code} if (numBipartiteSourceTasksCompleted == totalNumBipartiteSourceTasks && numPendingTasks >= 0) { {code} > DAG hangs when one of the upstream vertices has zero tasks > -- > > Key: TEZ-3203 > URL: https://issues.apache.org/jira/browse/TEZ-3203 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3203.001.patch, TEZ-3203.002.patch > > > A DAG hangs during execution if it has a vertex with multiple inputs and one > of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2104) A CrossProductEdge which produces synthetic cross-product parallelism
[ https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260681#comment-15260681 ] Bikas Saha commented on TEZ-2104: - bq. Sorry for the inconsistency. Slow start only make sense for partitioned case; for non-partitioned case, we launch a task only if its input is ready. Why so for non-partitioned? bq. While #connection is a issue in very large scale, having a grouping layer may not make it more scalable. Because everyone gets data from grouping nodes and grouping nodes may not have enough network bandwidth. +1. In my experience, aggregate trees are useful when there is a massive data reduction expected using the intermediate aggregation/combine operators. If not then the downside of redundant data copying is likely not useful compared to a carefully orchestrated sequence of connections that ensures limited and uniformly distributed load on data sources. Of course, not saying that we already have any heuristics to ensure limited and uniform load on sources now :P. If not, then that would be something to consider under this scenario because it significantly increases the connections compared to the current edges. TEZ-3209 would be good to have as something that could be used in the shuffle edge or cross edge. Wondering if its related to the cross edge idea of having filters that prune unwanted partitions, in the sense of removing or merging partitions - ie logical partition management, seems to be a unifying idea between them. On that note, if we figure out that some partitions are not needed then will we not create any tasks for them? I.e. this information is calculated up front before determining tasks? Is this available statically at compile time (provided to VM) or needed runtime information (calculated in VM)? > A CrossProductEdge which produces synthetic cross-product parallelism > - > > Key: TEZ-2104 > URL: https://issues.apache.org/jira/browse/TEZ-2104 > Project: Apache Tez > Issue Type: New Feature >Reporter: Gopal V >Assignee: Zhiyuan Yang > Labels: gsoc, gsoc2015, hadoop, hive, java, tez > Attachments: Cartesian product edge design.2.pdf, Cross product edge > design.pdf > > > Instead of producing duplicate data for the synthetic cross-product, to fit > into partitions, the amount of net IO can be vastly reduced by a special > purpose cross-product data movement edge. > The Shuffle edge routes each partition's output to a single reducer, while > the cross-product edge routes it into a matrix of reducers without actually > duplicating the disk data. > A partitioning scheme with 3 partitions on the lhs and rhs of a join > operation can be routed into 9 reducers by performing a cross-product similar > to > (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...] > This turns a single task cross-product model into a distributed cross product. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3232) Disable randomFailingInputs in testFaulttolerance to unblock other tests
[ https://issues.apache.org/jira/browse/TEZ-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258790#comment-15258790 ] Bikas Saha commented on TEZ-3232: - lgtm > Disable randomFailingInputs in testFaulttolerance to unblock other tests > - > > Key: TEZ-3232 > URL: https://issues.apache.org/jira/browse/TEZ-3232 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-3232.1.patch > > > The randomFailingInputs test causes the AM to hit an error condition and fail > other tests. For now it should be disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3219) Allow service plugins to define log locations link for remotely run task attempts
[ https://issues.apache.org/jira/browse/TEZ-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256846#comment-15256846 ] Bikas Saha commented on TEZ-3219: - Is there any issue in having the YARN based plugins provide the existing info via the new APIs instead of special casing them in the core code. This way the handling of all plugins is identical and makes the flow consistent and better for debugging and other issues. {code}-if (containerId != null && nodeHttpAddress != null) { - final String containerIdStr = containerId.toString(); - inProgressLogsUrl = nodeHttpAddress - + "/" + "node/containerlogs" - + "/" + containerIdStr - + "/" + this.appContext.getUser(); +if (getVertex().getServicePluginInfo().getContainerLauncherName().equals( + TezConstants.getTezYarnServicePluginName()) +|| getVertex().getServicePluginInfo().getContainerLauncherName().equals( + TezConstants.getTezUberServicePluginName())) { + if (containerId != null && nodeHttpAddress != null) { +final String containerIdStr = containerId.toString(); +inProgressLogsUrl = nodeHttpAddress ++ "/" + "node/containerlogs" ++ "/" + containerIdStr ++ "/" + this.appContext.getUser(); + } +} else { + inProgressLogsUrl = appContext.getTaskCommunicatorManager().getInProgressLogsUrl( + getVertex().getTaskCommunicatorIdentifier(), + attemptId, containerNodeId); }{code} > Allow service plugins to define log locations link for remotely run task > attempts > -- > > Key: TEZ-3219 > URL: https://issues.apache.org/jira/browse/TEZ-3219 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Fix For: 0.9.0, 0.8.4 > > Attachments: TEZ-3219.1.patch, TEZ-3219.2.patch, TEZ-3219.3.patch, > TEZ-3219.4.patch, TEZ-3219.5.patch > > > Today log links are generated based on the assumption that they are running > in yarn containers. For LLAP-like service plugin runs, the log links are > incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252975#comment-15252975 ] Bikas Saha edited comment on TEZ-3222 at 4/21/16 11:12 PM: --- ShuffleVertexManager, in theory, is a user land object and packaged in tez-runtime-library (along with other user land Inputs and Outputs). Hence, leaking ShuffleVertexManager into the framework DAG engine is crossing the border between user and system. I am afraid we may need to think about a different approach. {code}-if (routeMeta != null) { +if (edgeManagerOnDemand instanceof CustomShuffleEdgeManager) { + EnhancedDataMovementEvent edme = compEvent.expandEnhanced(srcTaskIndex, taskIndex, edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(0) / edgeManagerOnDemand.getContext().getSourceVertexNumTasks(), edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(taskIndex) / {code} Throttling of events being fetched by the input such that we dont get everything at once alleviated some issues. Is that related to this jira? If yes, what is the current jira trying to fix beyond the throttling mitigation? Just to throw some light on the criticality. was (Author: bikassaha): ShuffleVertexManager, in theory, is a user land object and packaged in tez-runtime-library (along with other user land Inputs and Outputs). Hence, leaking ShuffleVertexManager into the framework DAG engine is crossing the border between user and system. I am afraid we may need to think about a different approach. {code}-if (routeMeta != null) { +if (edgeManagerOnDemand instanceof CustomShuffleEdgeManager) { + EnhancedDataMovementEvent edme = compEvent.expandEnhanced(srcTaskIndex, taskIndex, edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(0) / edgeManagerOnDemand.getContext().getSourceVertexNumTasks(), edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(taskIndex) / {code}. Throttling of events being fetched by the input such that we dont get everything at once alleviated some issues. Is that related to this jira? If yes, what is the current jira trying to fix beyond the throttling mitigation? Just to throw some light on the criticality. > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3222) Reduce messaging overhead for auto-reduce parallelism case
[ https://issues.apache.org/jira/browse/TEZ-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252975#comment-15252975 ] Bikas Saha commented on TEZ-3222: - ShuffleVertexManager, in theory, is a user land object and packaged in tez-runtime-library (along with other user land Inputs and Outputs). Hence, leaking ShuffleVertexManager into the framework DAG engine is crossing the border between user and system. I am afraid we may need to think about a different approach. {code}-if (routeMeta != null) { +if (edgeManagerOnDemand instanceof CustomShuffleEdgeManager) { + EnhancedDataMovementEvent edme = compEvent.expandEnhanced(srcTaskIndex, taskIndex, edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(0) / edgeManagerOnDemand.getContext().getSourceVertexNumTasks(), edgeManagerOnDemand.getNumDestinationTaskPhysicalInputs(taskIndex) / {code}. Throttling of events being fetched by the input such that we dont get everything at once alleviated some issues. Is that related to this jira? If yes, what is the current jira trying to fix beyond the throttling mitigation? Just to throw some light on the criticality. > Reduce messaging overhead for auto-reduce parallelism case > -- > > Key: TEZ-3222 > URL: https://issues.apache.org/jira/browse/TEZ-3222 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3222.1.patch > > > A dag with 15k x 1000k vertex may auto-reduce to 15k x 1. And while the data > size is appropriate for 1 task attempt, this results in an increase in task > attempt message processing of 1000x. > This jira aims to reduce the message processing in the auto-reduced task > while keeping the amount of message processing in the AM the same or less. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3203) DAG hangs when one of the upstream vertices has zero tasks
[ https://issues.apache.org/jira/browse/TEZ-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231460#comment-15231460 ] Bikas Saha commented on TEZ-3203: - Good catch! Maybe we can get away with removing the numpendingtasks check here. I am worried that doing it earlier may be susceptible to calling scheduleTasks() multiple times. {code}if (numBipartiteSourceTasksCompleted == totalNumBipartiteSourceTasks && numPendingTasks > 0) { LOG.info("All source tasks assigned. " + "Ramping up " + numPendingTasks + " remaining tasks for vertex: " + getContext().getVertexName()); schedulePendingTasks(numPendingTasks, 1); return; }{code} > DAG hangs when one of the upstream vertices has zero tasks > -- > > Key: TEZ-3203 > URL: https://issues.apache.org/jira/browse/TEZ-3203 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3203.001.patch > > > A DAG hangs during execution if it has a vertex with multiple inputs and one > of those upstream vertices has zero tasks and is using ShuffleVertexManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
[ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230773#comment-15230773 ] Bikas Saha commented on TEZ-3198: - Yes. Looks like our defaults can be better for real life workloads. > Shuffle failures for the trailing task in a vertex are often fatal to the > entire DAG > > > Key: TEZ-3198 > URL: https://issues.apache.org/jira/browse/TEZ-3198 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0, 0.8.2 >Reporter: Jason Lowe >Priority: Critical > > I've seen an increasing number of cases where a single-node failure caused > the whole Tez DAG to fail. These scenarios are common in that they involve > the last task of a vertex attempting to complete a shuffle where all the peer > tasks have already finished shuffling. The last task's attempt encounters > errors shuffling one of its inputs and keeps reporting it to the AM. > Eventually the attempt decides it must be the cause of the shuffle error and > fails. The subsequent attempts all do the same thing, and eventually we hit > the task max attempts limit and fail the vertex and DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227749#comment-15227749 ] Bikas Saha commented on TEZ-3161: - Does a fatal error affect the recovery code path? E.g. fatal error got stored but dag failure did not get stored. What happens in recovery? Should dag fail after recovery because task fatal error was recovered. Likely yes, but does it work? Please ignore this comment in case its already covered. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Fix For: 0.8.3 > > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, > TEZ-3161.4.txt, TEZ-3161.5.txt, TEZ-3161.6.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
[ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227137#comment-15227137 ] Bikas Saha commented on TEZ-3198: - Yeah. Looks like a gap in that heuristic. Maybe a test for this case would help when we make the next heuristic update. > Shuffle failures for the trailing task in a vertex are often fatal to the > entire DAG > > > Key: TEZ-3198 > URL: https://issues.apache.org/jira/browse/TEZ-3198 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0, 0.8.2 >Reporter: Jason Lowe >Priority: Critical > Fix For: 0.7.1, 0.8.3 > > > I've seen an increasing number of cases where a single-node failure caused > the whole Tez DAG to fail. These scenarios are common in that they involve > the last task of a vertex attempting to complete a shuffle where all the peer > tasks have already finished shuffling. The last task's attempt encounters > errors shuffling one of its inputs and keeps reporting it to the AM. > Eventually the attempt decides it must be the cause of the shuffle error and > fails. The subsequent attempts all do the same thing, and eventually we hit > the task max attempts limit and fail the vertex and DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3193) Deadlock in AM during task commit request
[ https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221046#comment-15221046 ] Bikas Saha commented on TEZ-3193: - This is probably a leftover of removal of such reverse calls. There were more of them and some were removed by making sure that such objects/members are available locally to the TaskAttemptImpl (from the Task passed in via the constructor) instead of calling back into the task to get this object/members. Hence, task location hint and taskSpec could be passed in via the constructor and referenced locally. Doing this helps other future scenarios as well. If the TA location hint is passed in via a constructor then it could be made different for each attempt. E.g. remove the machine for v.1 from the location hint of v.2 for a speculative execution so that speculated attempt does not end up on the same machine. There is a jira for open for this. Similarly, change the spec of v.1 have higher memory than the default for that vertex because v.0 died with OOM. > Deadlock in AM during task commit request > - > > Key: TEZ-3193 > URL: https://issues.apache.org/jira/browse/TEZ-3193 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.8.2 >Reporter: Jason Lowe >Priority: Blocker > > The AM can deadlock between TaskImpl and TaskAttemptImpl. Stacktrace and > details in a followup comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle
[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208891#comment-15208891 ] Bikas Saha commented on TEZ-2442: - We typically use fs instead of dfs and DistributedFileSystem is actually the name of the HDFS impl of the FileSystem API. > Support DFS based shuffle in addition to HTTP shuffle > - > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.3 >Reporter: Kannan Rajah >Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle
[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1520#comment-1520 ] Bikas Saha commented on TEZ-2442: - IIRC, this is the same for both kinds of shuffle. Because consumers can fetch and merge spills as they happen in a pipelined manner as they get the DME for each spilled output. The physical fetch method (HTTP or FS) is likely not relevant. [~rajesh.balamohan] can correct me if this is inaccurate. > Support DFS based shuffle in addition to HTTP shuffle > - > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.3 >Reporter: Kannan Rajah >Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle
[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207668#comment-15207668 ] Bikas Saha commented on TEZ-2442: - Its important to keep in mind that for significant perf gains, final merge of the output could be avoided. So the output would live as separate files. Would be good for the design to allow for this improvement in the future. ie. allows multiple final output files to be written. [~rajesh.balamohan] > Support DFS based shuffle in addition to HTTP shuffle > - > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.3 >Reporter: Kannan Rajah >Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle
[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207647#comment-15207647 ] Bikas Saha edited comment on TEZ-2442 at 3/23/16 12:59 AM: --- Should this config instead be something that specifies the class name used to do the final read/write or the filesystem scheme to use instead of hard coding hdfs? Then we could specify RawLocalImpl/HDFSImpl/WASBImpl/S3Impl or local/hdfs/wasb/s3. Of course that would depend on the impl :) was (Author: bikassaha): Should this config instead be something that specifies the class name used to do the final read/write or the filesystem scheme to use instead of hard coding hdfs? Then we could specify RawLocalImpl/HDFSImpl/WASBImpl/S3Impl or local/hdfs/wasb/s3 > Support DFS based shuffle in addition to HTTP shuffle > - > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.3 >Reporter: Kannan Rajah >Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle
[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207647#comment-15207647 ] Bikas Saha commented on TEZ-2442: - Should this config instead be something that specifies the class name used to do the final read/write or the filesystem scheme to use instead of hard coding hdfs? Then we could specify RawLocalImpl/HDFSImpl/WASBImpl/S3Impl or local/hdfs/wasb/s3 > Support DFS based shuffle in addition to HTTP shuffle > - > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement >Affects Versions: 0.5.3 >Reporter: Kannan Rajah >Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3181) History parser : Handle invalid/unsupported history event types gracefully
[ https://issues.apache.org/jira/browse/TEZ-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205820#comment-15205820 ] Bikas Saha commented on TEZ-3181: - I understand. But after that when this incomplete data is passed to 0.8 analyzers then how can we expect them to work correctly? My concern is that consumers of this data may not handle such dropped data and instead depend on the parser to ensure that the data is valid. Dropping this event would make the data invalid. Perhaps instead of dropping it, we could translate it into something that makes sense on the 0.8 side but that would need versioning via TEZ-3179. Does this make sense or am I missing something? :) I am not sure how making the parser succeed would be an end goal by itself since the parsed data is going to be consumer by analyzers. > History parser : Handle invalid/unsupported history event types gracefully > -- > > Key: TEZ-3181 > URL: https://issues.apache.org/jira/browse/TEZ-3181 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-3181.1.patch > > > TEZ-2581 changed/renamed some of HistoryEventType. This causes parser to > throw exception when trying to parse 0.7.x ATS data with 0.8.x parser. > {noformat} > Exception in thread "main" java.lang.IllegalArgumentException: No enum > constant > org.apache.tez.dag.history.HistoryEventType.VERTEX_PARALLELISM_UPDATED >at java.lang.Enum.valueOf(Enum.java:238) >at > org.apache.tez.dag.history.HistoryEventType.valueOf(HistoryEventType.java:21) >at > org.apache.tez.history.parser.datamodel.VertexInfo.(VertexInfo.java:117) >at > org.apache.tez.history.parser.datamodel.VertexInfo.create(VertexInfo.java:159) >at > org.apache.tez.history.parser.ATSFileParser.processVertices(ATSFileParser.java:98) >at > org.apache.tez.history.parser.ATSFileParser.parseATSZipFile(ATSFileParser.java:202) >at > org.apache.tez.history.parser.ATSFileParser.getDAGData(ATSFileParser.java:70) > {noformat} > Long term fix is to have versioning support (TEZ-3179) in ATS data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3181) History parser : Handle invalid/unsupported history event types gracefully
[ https://issues.apache.org/jira/browse/TEZ-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205794#comment-15205794 ] Bikas Saha commented on TEZ-3181: - Do we need this? Could we use 0.7 parser for 0.7 jobs and 0.8 parser for 0.8 jobs. My concern is that we use 0.8 parser, ignore some fields that are needed while parsing, and then the analyzers will fail or worse produce wrong results. This could be because analyzers are expecting a certain structure and that has changed in the data. > History parser : Handle invalid/unsupported history event types gracefully > -- > > Key: TEZ-3181 > URL: https://issues.apache.org/jira/browse/TEZ-3181 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-3181.1.patch > > > TEZ-2581 changed/renamed some of HistoryEventType. This causes parser to > throw exception when trying to parse 0.7.x ATS data with 0.8.x parser. > {noformat} > Exception in thread "main" java.lang.IllegalArgumentException: No enum > constant > org.apache.tez.dag.history.HistoryEventType.VERTEX_PARALLELISM_UPDATED >at java.lang.Enum.valueOf(Enum.java:238) >at > org.apache.tez.dag.history.HistoryEventType.valueOf(HistoryEventType.java:21) >at > org.apache.tez.history.parser.datamodel.VertexInfo.(VertexInfo.java:117) >at > org.apache.tez.history.parser.datamodel.VertexInfo.create(VertexInfo.java:159) >at > org.apache.tez.history.parser.ATSFileParser.processVertices(ATSFileParser.java:98) >at > org.apache.tez.history.parser.ATSFileParser.parseATSZipFile(ATSFileParser.java:202) >at > org.apache.tez.history.parser.ATSFileParser.getDAGData(ATSFileParser.java:70) > {noformat} > Long term fix is to have versioning support (TEZ-3179) in ATS data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3168) Provide a more predictable approach for total resource guidance for wave/split calculation
[ https://issues.apache.org/jira/browse/TEZ-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200619#comment-15200619 ] Bikas Saha commented on TEZ-3168: - For all of the problems with queue capacity, IMO cluster capacity is a more stable metric to look at. Logically, the data is distributed across the cluster and so accounting for that dispersion while calculating splits. This also solves the current immediate problem of creating too small splits. Essentially the job wants to run tasks across all cluster nodes. The queue capacity determines how the job gets waves/windows of tasks that move around the cluster to read that data locally. > Provide a more predictable approach for total resource guidance for > wave/split calculation > --- > > Key: TEZ-3168 > URL: https://issues.apache.org/jira/browse/TEZ-3168 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-3168.wip.2.patch, TEZ-3168.wip.patch > > > Currently, Tez uses headroom for checking total available resources. This is > flaky as it ends up causing the split count to be determined by a point in > time lookup at what is available in the cluster. A better approach would be > either the queue size or even cluster size to get a more predictable count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3164) Surface error histograms from the AM
Bikas Saha created TEZ-3164: --- Summary: Surface error histograms from the AM Key: TEZ-3164 URL: https://issues.apache.org/jira/browse/TEZ-3164 Project: Apache Tez Issue Type: Improvement Reporter: Bikas Saha Job tasks are constantly probing the cluster. So if there are some issues in the cluster then jobs would be the first to notice that. If we can make these observations surface to the user then we could quickly identify cluster issues. Lets say a set of bad machines got added to the cluster and tasks started seeing shuffle errors from those machines. This can slow down or hang the job. If the AM can surface increased errors counts from source and destination machines then that could pin point the bad machines vs having to arrive at those machines from first principles and log searching. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3085) In session mode, the credentials passed via the Tez client constructor is not available to all the tasks
[ https://issues.apache.org/jira/browse/TEZ-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181432#comment-15181432 ] Bikas Saha commented on TEZ-3085: - Yes. And looks like its already mentioned in the first comment of this jira. > In session mode, the credentials passed via the Tez client constructor is not > available to all the tasks > > > Key: TEZ-3085 > URL: https://issues.apache.org/jira/browse/TEZ-3085 > Project: Apache Tez > Issue Type: Bug >Reporter: Vinoth Sathappan > > The credentials passed through the Tez client constructor isn't available for > the tasks in session mode. > TezClient(String name, TezConfiguration tezConf, > @Nullable Map localResources, > @Nullable Credentials credentials) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3085) In session mode, the credentials passed via the Tez client constructor is not available to all the tasks
[ https://issues.apache.org/jira/browse/TEZ-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181338#comment-15181338 ] Bikas Saha commented on TEZ-3085: - IIRC, didn't we recently start passing AM credentials to the DAG? > In session mode, the credentials passed via the Tez client constructor is not > available to all the tasks > > > Key: TEZ-3085 > URL: https://issues.apache.org/jira/browse/TEZ-3085 > Project: Apache Tez > Issue Type: Bug >Reporter: Vinoth Sathappan > > The credentials passed through the Tez client constructor isn't available for > the tasks in session mode. > TezClient(String name, TezConfiguration tezConf, > @Nullable Map localResources, > @Nullable Credentials credentials) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1210) TezClientUtils.localizeDagPlanAsText() needs to be fixed for session mode
[ https://issues.apache.org/jira/browse/TEZ-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180180#comment-15180180 ] Bikas Saha commented on TEZ-1210: - The DAGPlan is downloaded as a local resource to be used to run the DAG. In session mode the AM is already running and can accept the DAGPlan over RPC. In non-session mode is going to be launched (and there is no connection between it and the client) and thus the DAGPlan needs to be provides indirectly as a YARN LocalResource via HDFS. > TezClientUtils.localizeDagPlanAsText() needs to be fixed for session mode > - > > Key: TEZ-1210 > URL: https://issues.apache.org/jira/browse/TEZ-1210 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Alexander Pivovarov > Labels: newbie > Fix For: 0.5.2 > > Attachments: TEZ-1210.1.patch, TEZ-1210.2.patch > > > It writes the dagPlan in text form to the same location. Either it should not > be invoked in session mode or it should written with a differentiating prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3149) Tez-tools: Add username in DagInfo
[ https://issues.apache.org/jira/browse/TEZ-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172648#comment-15172648 ] Bikas Saha commented on TEZ-3149: - lgtm . backport to 0.7 would be good. thanks! > Tez-tools: Add username in DagInfo > -- > > Key: TEZ-3149 > URL: https://issues.apache.org/jira/browse/TEZ-3149 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan > Attachments: TEZ-3149.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3014) OOM during Shuffle in JDK 8
[ https://issues.apache.org/jira/browse/TEZ-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171245#comment-15171245 ] Bikas Saha commented on TEZ-3014: - [~jeagles] [~jlowe] Is this still an issue? OOM + JDK 8. If not, then we could close this. > OOM during Shuffle in JDK 8 > --- > > Key: TEZ-3014 > URL: https://issues.apache.org/jira/browse/TEZ-3014 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2580) Remove VertexManagerPlugin#setVertexParallelism with VertexManagerPlugin#reconfigureVertex
[ https://issues.apache.org/jira/browse/TEZ-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171239#comment-15171239 ] Bikas Saha commented on TEZ-2580: - We can only change this if dependent projects like Hive stop using it or else they will fail to compile. Not sure if they have done that. > Remove VertexManagerPlugin#setVertexParallelism with > VertexManagerPlugin#reconfigureVertex > -- > > Key: TEZ-2580 > URL: https://issues.apache.org/jira/browse/TEZ-2580 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: TEZ-2580.001.patch > > > This was deprecated in 0.7. Should be replaced with reconfigureVertex() - > change of name - to make it consistent with other reconfigureVertex() API's. > Should be done just close to release to enabled Hive to continue to build/use > master of Tez. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3124) Running task hangs due to missing event to initialize input in recovery
[ https://issues.apache.org/jira/browse/TEZ-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160058#comment-15160058 ] Bikas Saha commented on TEZ-3124: - So in this case task needed event to start and so it hung. If initgenerated events is legitimately empty then task will not hang and overall we will not hang. > Running task hangs due to missing event to initialize input in recovery > --- > > Key: TEZ-3124 > URL: https://issues.apache.org/jira/browse/TEZ-3124 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Labels: Recovery > Fix For: 0.8.3 > > Attachments: TEZ-3124-1.patch, TEZ-3124-2.patch, TEZ-3124-3.patch, > TEZ-3124-4.patch, TEZ-3124-5.patch, a.log > > > {noformat} > 2016-02-09 04:48:42 Starting to run new task attempt: > attempt_1454993155302_0001_1_00_61_3 > /attempt_1454993155302_0001_1_00_61 > 2016-02-09 04:48:43,196 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInput|: MRInput using newmapreduce API=true, split via event=true, > numPhysicalInputs=1 > 2016-02-09 04:48:43,200 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInputLegacy|: MRInput MRInputLegacy deferring initialization > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Initialized processor > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 2 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: All initializers finished > 2016-02-09 04:48:43,345 [INFO] [TezChild] |resources.MemoryDistributor|: > InitialRequests=[MRInput:INPUT:0:org.apache.tez.mapreduce.input.MRInputLegacy], > > [ireduce1:OUTPUT:1802502144:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput] > 2016-02-09 04:48:43,559 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: > ScaleRatiosUsed=[PARTITIONED_UNSORTED_OUTPUT:1][UNSORTED_OUTPUT:1][UNSORTED_INPUT:1][SORTED_OUTPUT:12][SORTED_MERGED_INPUT:12][PROCESSOR:1][OTHER:1] > 2016-02-09 04:48:43,563 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: InitialReservationFraction=0.3, > AdditionalReservationFractionForIOs=0.03, > finalReserveFractionUsed=0.32996 > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: Scaling Requests. NumRequests: > 2, numScaledRequests: 13, TotalRequested: 1802502144, TotalRequestedScaled: > 1.663848132923077E9, TotalJVMHeap: 2577399808, TotalAvailable: 1726857871, > TotalRequested/TotalJVMHeap:0.70 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.MemoryDistributor|: > Allocations=[MRInput:org.apache.tez.mapreduce.input.MRInputLegacy:INPUT:0:0], > [ireduce1:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:1802502144:1726857871] > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Starting Inputs/Outputs > 2016-02-09 04:48:43,572 [INFO] [I/O Setup 1 Start: {MRInput}] > |runtime.LogicalIOProcessorRuntimeTask|: Started Input with src edge: MRInput > 2016-02-09 04:48:43,572 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Input: MRInput being auto started by > the framework. Subsequent instances will not be auto-started > 2016-02-09 04:48:43,573 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Num IOs determined for AutoStart: 1 > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 IOs to start > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: AutoStartComplete > 2016-02-09 04:48:43,583 [INFO] [TezChild] |task.TaskRunner2Callable|: Running > task, taskAttemptId=attempt_1454993155302_0001_1_00_61_3 > 2016-02-09 04:48:43,583 [INFO] [TezChild] |map.MapProcessor|: Running map: > attempt_1454993155302_0001_1_00_61_3_10001 > 2016-02-09 04:48:43,675 [INFO] [TezChild] |impl.ExternalSorter|: ireduce1 > using: memoryMb=1646, keySerializerClass=class > org.apache.hadoop.io.IntWritable, > valueSerializerClass=org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer@5f143de6, > comparator=org.apache.hadoop.io.IntWritable$Comparator@ec52d1f, > partitioner=org.apache.tez.mapreduce.partition.MRPartitioner, > serialization=org.apache.hadoop.io.serializer.WritableSerialization > 2016-02-09 04:48:43,686 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up > PipelinedSorter for ireduce1: , UsingHashComparator=false > 2016-02-09 04:48:45,093 [INFO] [TezChild] |impl.PipelinedSorter|: Newly >
[jira] [Commented] (TEZ-3124) Running task hangs due to missing event to initialize input in recovery
[ https://issues.apache.org/jira/browse/TEZ-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160059#comment-15160059 ] Bikas Saha commented on TEZ-3124: - lgtm. +1. Thanks! > Running task hangs due to missing event to initialize input in recovery > --- > > Key: TEZ-3124 > URL: https://issues.apache.org/jira/browse/TEZ-3124 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Labels: Recovery > Fix For: 0.8.3 > > Attachments: TEZ-3124-1.patch, TEZ-3124-2.patch, TEZ-3124-3.patch, > TEZ-3124-4.patch, TEZ-3124-5.patch, a.log > > > {noformat} > 2016-02-09 04:48:42 Starting to run new task attempt: > attempt_1454993155302_0001_1_00_61_3 > /attempt_1454993155302_0001_1_00_61 > 2016-02-09 04:48:43,196 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInput|: MRInput using newmapreduce API=true, split via event=true, > numPhysicalInputs=1 > 2016-02-09 04:48:43,200 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInputLegacy|: MRInput MRInputLegacy deferring initialization > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Initialized processor > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 2 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: All initializers finished > 2016-02-09 04:48:43,345 [INFO] [TezChild] |resources.MemoryDistributor|: > InitialRequests=[MRInput:INPUT:0:org.apache.tez.mapreduce.input.MRInputLegacy], > > [ireduce1:OUTPUT:1802502144:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput] > 2016-02-09 04:48:43,559 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: > ScaleRatiosUsed=[PARTITIONED_UNSORTED_OUTPUT:1][UNSORTED_OUTPUT:1][UNSORTED_INPUT:1][SORTED_OUTPUT:12][SORTED_MERGED_INPUT:12][PROCESSOR:1][OTHER:1] > 2016-02-09 04:48:43,563 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: InitialReservationFraction=0.3, > AdditionalReservationFractionForIOs=0.03, > finalReserveFractionUsed=0.32996 > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: Scaling Requests. NumRequests: > 2, numScaledRequests: 13, TotalRequested: 1802502144, TotalRequestedScaled: > 1.663848132923077E9, TotalJVMHeap: 2577399808, TotalAvailable: 1726857871, > TotalRequested/TotalJVMHeap:0.70 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.MemoryDistributor|: > Allocations=[MRInput:org.apache.tez.mapreduce.input.MRInputLegacy:INPUT:0:0], > [ireduce1:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:1802502144:1726857871] > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Starting Inputs/Outputs > 2016-02-09 04:48:43,572 [INFO] [I/O Setup 1 Start: {MRInput}] > |runtime.LogicalIOProcessorRuntimeTask|: Started Input with src edge: MRInput > 2016-02-09 04:48:43,572 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Input: MRInput being auto started by > the framework. Subsequent instances will not be auto-started > 2016-02-09 04:48:43,573 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Num IOs determined for AutoStart: 1 > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 IOs to start > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: AutoStartComplete > 2016-02-09 04:48:43,583 [INFO] [TezChild] |task.TaskRunner2Callable|: Running > task, taskAttemptId=attempt_1454993155302_0001_1_00_61_3 > 2016-02-09 04:48:43,583 [INFO] [TezChild] |map.MapProcessor|: Running map: > attempt_1454993155302_0001_1_00_61_3_10001 > 2016-02-09 04:48:43,675 [INFO] [TezChild] |impl.ExternalSorter|: ireduce1 > using: memoryMb=1646, keySerializerClass=class > org.apache.hadoop.io.IntWritable, > valueSerializerClass=org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer@5f143de6, > comparator=org.apache.hadoop.io.IntWritable$Comparator@ec52d1f, > partitioner=org.apache.tez.mapreduce.partition.MRPartitioner, > serialization=org.apache.hadoop.io.serializer.WritableSerialization > 2016-02-09 04:48:43,686 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up > PipelinedSorter for ireduce1: , UsingHashComparator=false > 2016-02-09 04:48:45,093 [INFO] [TezChild] |impl.PipelinedSorter|: Newly > allocated block size=1725956096, index=0, Number of buffers=1, > currentAllocatableMemory=0, currentBufferSize=1725956096, total=1725956096 >
[jira] [Commented] (TEZ-3124) Running task hangs due to missing event to initialize input in recovery
[ https://issues.apache.org/jira/browse/TEZ-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159946#comment-15159946 ] Bikas Saha commented on TEZ-3124: - Then the fix should be restricted to not logging VertexInitializedEvent if shouldSkipInit is true. Initializing the initGeneratedEvent to the old value might have the side effect when shouldSkipInit is false. In that case init will run again and initGeneratedEvent could have old recovered events and new init generated events. Can this happen? Even if no, why add the side effect of initing initGeneratedEvent? Your explanation makes sense for the fix. My concern is for the change to initGeneratedEvents. Orthogonally, initGeneratedEvents could be empty even after init. This is valid. Will that be a problem? Asking because in this case we got hung because vertex initialized event had empty initGeneratedEvents. > Running task hangs due to missing event to initialize input in recovery > --- > > Key: TEZ-3124 > URL: https://issues.apache.org/jira/browse/TEZ-3124 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Labels: Recovery > Fix For: 0.8.3 > > Attachments: TEZ-3124-1.patch, TEZ-3124-2.patch, TEZ-3124-3.patch, > TEZ-3124-4.patch, a.log > > > {noformat} > 2016-02-09 04:48:42 Starting to run new task attempt: > attempt_1454993155302_0001_1_00_61_3 > /attempt_1454993155302_0001_1_00_61 > 2016-02-09 04:48:43,196 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInput|: MRInput using newmapreduce API=true, split via event=true, > numPhysicalInputs=1 > 2016-02-09 04:48:43,200 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInputLegacy|: MRInput MRInputLegacy deferring initialization > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Initialized processor > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 2 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: All initializers finished > 2016-02-09 04:48:43,345 [INFO] [TezChild] |resources.MemoryDistributor|: > InitialRequests=[MRInput:INPUT:0:org.apache.tez.mapreduce.input.MRInputLegacy], > > [ireduce1:OUTPUT:1802502144:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput] > 2016-02-09 04:48:43,559 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: > ScaleRatiosUsed=[PARTITIONED_UNSORTED_OUTPUT:1][UNSORTED_OUTPUT:1][UNSORTED_INPUT:1][SORTED_OUTPUT:12][SORTED_MERGED_INPUT:12][PROCESSOR:1][OTHER:1] > 2016-02-09 04:48:43,563 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: InitialReservationFraction=0.3, > AdditionalReservationFractionForIOs=0.03, > finalReserveFractionUsed=0.32996 > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: Scaling Requests. NumRequests: > 2, numScaledRequests: 13, TotalRequested: 1802502144, TotalRequestedScaled: > 1.663848132923077E9, TotalJVMHeap: 2577399808, TotalAvailable: 1726857871, > TotalRequested/TotalJVMHeap:0.70 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.MemoryDistributor|: > Allocations=[MRInput:org.apache.tez.mapreduce.input.MRInputLegacy:INPUT:0:0], > [ireduce1:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:1802502144:1726857871] > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Starting Inputs/Outputs > 2016-02-09 04:48:43,572 [INFO] [I/O Setup 1 Start: {MRInput}] > |runtime.LogicalIOProcessorRuntimeTask|: Started Input with src edge: MRInput > 2016-02-09 04:48:43,572 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Input: MRInput being auto started by > the framework. Subsequent instances will not be auto-started > 2016-02-09 04:48:43,573 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Num IOs determined for AutoStart: 1 > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 IOs to start > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: AutoStartComplete > 2016-02-09 04:48:43,583 [INFO] [TezChild] |task.TaskRunner2Callable|: Running > task, taskAttemptId=attempt_1454993155302_0001_1_00_61_3 > 2016-02-09 04:48:43,583 [INFO] [TezChild] |map.MapProcessor|: Running map: > attempt_1454993155302_0001_1_00_61_3_10001 > 2016-02-09 04:48:43,675 [INFO] [TezChild] |impl.ExternalSorter|: ireduce1 > using: memoryMb=1646, keySerializerClass=class > org.apache.had
[jira] [Commented] (TEZ-3102) Fetch failure of a speculated task causes job hang
[ https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159835#comment-15159835 ] Bikas Saha commented on TEZ-3102: - +1. I think testTaskSucceedAndRetroActiveFailure() should be covering the new code changes in the success attempt code path. In the small chance that its not, would you please update the test. Thanks! > Fetch failure of a speculated task causes job hang > -- > > Key: TEZ-3102 > URL: https://issues.apache.org/jira/browse/TEZ-3102 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3102.001.patch, TEZ-3102.002.patch > > > If a task speculates then succeeds, one task will be marked successful and > the other killed. Then if the task retroactively fails due to fetch failures > the Tez AM will fail to reschedule another task. This results in a hung job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3102) Fetch failure of a speculated task causes job hang
[ https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159835#comment-15159835 ] Bikas Saha edited comment on TEZ-3102 at 2/23/16 11:09 PM: --- +1. I think testTaskSucceedAndRetroActiveFailure() should already be covering the new code changes in the success attempt code path. In the small chance that its not, would you please update the test. Thanks! was (Author: bikassaha): +1. I think testTaskSucceedAndRetroActiveFailure() should be covering the new code changes in the success attempt code path. In the small chance that its not, would you please update the test. Thanks! > Fetch failure of a speculated task causes job hang > -- > > Key: TEZ-3102 > URL: https://issues.apache.org/jira/browse/TEZ-3102 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: TEZ-3102.001.patch, TEZ-3102.002.patch > > > If a task speculates then succeeds, one task will be marked successful and > the other killed. Then if the task retroactively fails due to fetch failures > the Tez AM will fail to reschedule another task. This results in a hung job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3124) Running task hangs due to missing event to initialize input in recovery
[ https://issues.apache.org/jira/browse/TEZ-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159548#comment-15159548 ] Bikas Saha commented on TEZ-3124: - Lets say shouldSkipInit() is false because VertexInitializedEvent !=null but ConfigurationDoneEvent == null. So we will rerun init. And then we will log another VertexInitializedEvent. Right? In that case how will the next AM attempt handle multiple VertexInitializedEvent? If we are doing init again, then that process will add new items into initGeneratedEvents. So we should not be restoring older initGeneratedEvents into the new object or else the new object will have more items than necessary. So I am not sure what is broken and how the fix is working. Could you please help by pointing out the exact sequence of events that causes the issue? Thanks! > Running task hangs due to missing event to initialize input in recovery > --- > > Key: TEZ-3124 > URL: https://issues.apache.org/jira/browse/TEZ-3124 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Labels: Recovery > Fix For: 0.8.3 > > Attachments: TEZ-3124-1.patch, TEZ-3124-2.patch, TEZ-3124-3.patch, > TEZ-3124-4.patch, a.log > > > {noformat} > 2016-02-09 04:48:42 Starting to run new task attempt: > attempt_1454993155302_0001_1_00_61_3 > /attempt_1454993155302_0001_1_00_61 > 2016-02-09 04:48:43,196 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInput|: MRInput using newmapreduce API=true, split via event=true, > numPhysicalInputs=1 > 2016-02-09 04:48:43,200 [INFO] [I/O Setup 0 Initialize: {MRInput}] > |input.MRInputLegacy|: MRInput MRInputLegacy deferring initialization > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Initialized processor > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 2 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: All initializers finished > 2016-02-09 04:48:43,345 [INFO] [TezChild] |resources.MemoryDistributor|: > InitialRequests=[MRInput:INPUT:0:org.apache.tez.mapreduce.input.MRInputLegacy], > > [ireduce1:OUTPUT:1802502144:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput] > 2016-02-09 04:48:43,559 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: > ScaleRatiosUsed=[PARTITIONED_UNSORTED_OUTPUT:1][UNSORTED_OUTPUT:1][UNSORTED_INPUT:1][SORTED_OUTPUT:12][SORTED_MERGED_INPUT:12][PROCESSOR:1][OTHER:1] > 2016-02-09 04:48:43,563 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: InitialReservationFraction=0.3, > AdditionalReservationFractionForIOs=0.03, > finalReserveFractionUsed=0.32996 > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |resources.WeightedScalingMemoryDistributor|: Scaling Requests. NumRequests: > 2, numScaledRequests: 13, TotalRequested: 1802502144, TotalRequestedScaled: > 1.663848132923077E9, TotalJVMHeap: 2577399808, TotalAvailable: 1726857871, > TotalRequested/TotalJVMHeap:0.70 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.MemoryDistributor|: > Allocations=[MRInput:org.apache.tez.mapreduce.input.MRInputLegacy:INPUT:0:0], > [ireduce1:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:1802502144:1726857871] > 2016-02-09 04:48:43,564 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Starting Inputs/Outputs > 2016-02-09 04:48:43,572 [INFO] [I/O Setup 1 Start: {MRInput}] > |runtime.LogicalIOProcessorRuntimeTask|: Started Input with src edge: MRInput > 2016-02-09 04:48:43,572 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Input: MRInput being auto started by > the framework. Subsequent instances will not be auto-started > 2016-02-09 04:48:43,573 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Num IOs determined for AutoStart: 1 > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 IOs to start > 2016-02-09 04:48:43,574 [INFO] [TezChild] > |runtime.LogicalIOProcessorRuntimeTask|: AutoStartComplete > 2016-02-09 04:48:43,583 [INFO] [TezChild] |task.TaskRunner2Callable|: Running > task, taskAttemptId=attempt_1454993155302_0001_1_00_61_3 > 2016-02-09 04:48:43,583 [INFO] [TezChild] |map.MapProcessor|: Running map: > attempt_1454993155302_0001_1_00_61_3_10001 > 2016-02-09 04:48:43,675 [INFO] [TezChild] |impl.ExternalSorter|: ireduce1 > using: memoryMb=1646, keySerializerClass=class > org.apache.hadoop.io.IntWritable, > valueSerializerClass=org.apache.hadoo
[jira] [Commented] (TEZ-2962) Use per partition stats in shuffle vertex manager auto parallelism
[ https://issues.apache.org/jira/browse/TEZ-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159414#comment-15159414 ] Bikas Saha commented on TEZ-2962: - The downside of partition stats is that the values are approximate in buckets of 1mb/10mb/100mb etc. So 100MB stat could imply 900mb actual data size. So respecting max data size per task can become tricky. > Use per partition stats in shuffle vertex manager auto parallelism > -- > > Key: TEZ-2962 > URL: https://issues.apache.org/jira/browse/TEZ-2962 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Priority: Critical > > The original code used output size sent by completed tasks. Recently per > partition stats have been added that provide granular information. Using > partition stats may be more accurate and also remove the duplicate counting > of data size in partition stats and per task overall. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3126) Log reason for not reducing parallelism
[ https://issues.apache.org/jira/browse/TEZ-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158125#comment-15158125 ] Bikas Saha commented on TEZ-3126: - lgtm > Log reason for not reducing parallelism > --- > > Key: TEZ-3126 > URL: https://issues.apache.org/jira/browse/TEZ-3126 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles >Priority: Critical > Attachments: TEZ-3126.1.patch, TEZ-3126.2.patch > > > For example, when reducing parallelism from 36 to 22. The basePartitionRange > will be 1 and will not re-configure the vertex. > {code:java|title=ShuffleVertexManager#determineParallelismAndApply|borderStyle=dashed|bgColor=lightgrey} > int desiredTaskParallelism = > (int)( > (expectedTotalSourceTasksOutputSize+desiredTaskInputDataSize-1)/ > desiredTaskInputDataSize); > if(desiredTaskParallelism < minTaskParallelism) { > desiredTaskParallelism = minTaskParallelism; > } > > if(desiredTaskParallelism >= currentParallelism) { > return true; > } > > // most shufflers will be assigned this range > basePartitionRange = currentParallelism/desiredTaskParallelism; > > if (basePartitionRange <= 1) { > // nothing to do if range is equal 1 partition. shuffler does it by > default > return true; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3131) Support a way to override test_root_dir for FaultToleranceTestRunner
[ https://issues.apache.org/jira/browse/TEZ-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157572#comment-15157572 ] Bikas Saha commented on TEZ-3131: - Sure. Please go ahead. +1. > Support a way to override test_root_dir for FaultToleranceTestRunner > > > Key: TEZ-3131 > URL: https://issues.apache.org/jira/browse/TEZ-3131 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Minor > Attachments: TEZ-3131.1.patch > > > The path is hardcoded. For regression testing, it will be useful if it can be > overridden via command-line if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)