[ https://issues.apache.org/jira/browse/TEZ-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304797#comment-14304797 ]
Jeff Zhang commented on TEZ-391: -------------------------------- bq. We should probably name it GroupOutputEdge to be symmetric with GroupInputEdge. I still think ShareOutputEdge is more suitable. Because for GroupInputEdge, there's multiple inputs from upstream vertices, we group them together into GroupInput. While for ShareOutputEdge, there's actually only one output from upstream vertex. So from semantic perspective I think ShareOutputEdge is better. Besides, there's one concept of SharedOutput in VertexImpl (VertexImpl:: addSharedOutputs ) for output to a data sink. I think this kind of output be renamed as GroupOutput is much better. bq. Is the design suggesting that a group output edge expand into standard edges with additional metadata at the source vertex which will enable its TezChild to provide a single output to its tasks even though there are multiple consumers? Yes, TezChild only has one output but would send multiple events to AM based on the additional metadata about the share edge. bq. What happens to fault tolerance? If a destination vertex reports an error about a shared source then what should happen in other destination vertices that are sharing that source? The upstream vertex will get the the InputReadErrorEvent and would send the InputFailedEvent to both downstream vertices. In theory it should be no problem. But you are right, I think I need to highlight these case and verify it in unittest. bq. Related: When an output of a task is marked bad then it sends an InputFailed event to its destination tasks. This happens in the AM and needs to be sent to all destination tasks of a shared output. So the AM routing would need to take into account shared outputs for this case. For the AM, it knows the standard edges that are expanded from share edge. so all the downstream vertices will get the InputFailed event. bq. Can it happen that a VertexGroup is connected to another VertexGroup? What use case would that be? Good question. This case would be 2 union join together and one of them is replicated part. In this case the edges between these vertex group would be both GroupInputEdge and ShareOutputEdge. Need to look into it more deeply. {code} a = load 'file:///tmp/input' as (x:int, y:chararray); b = load 'file:///tmp/input' as (y:chararray, x:int); c = union onschema a, b; d = load 'file:///tmp/input1' as (x:int, z:chararray); e = load 'file:///tmp/input2' as (x:int, z:chararray); f = union onschema d,e; g = join c by x, d by f using 'replicated'; store g into 'file:///tmp/output'; {code} Besides, I am thinking is it necessary to expose the GroupInputEdge/ShareOutputEdge as public API. User just need to create edge by connecting one Vertex/VertexGroup and another Vertex/VertexGroup (2 by 2 cases)., * If the destination is vertex group, then that mean they share the one copy of output from source no matter the source is vertex or vertex group. * Meanwhile, If the source is vertex group, then that mean destination use the merged input from the destination no matter the destination is vertex or vertex group. > SharedEdge - Support for passing same output from a vertex as input to two > different vertices > --------------------------------------------------------------------------------------------- > > Key: TEZ-391 > URL: https://issues.apache.org/jira/browse/TEZ-391 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Rohini Palaniswamy > Assignee: Jeff Zhang > Attachments: Shared Edge Design.pdf, TEZ-391-WIP-1.patch, > TEZ-391-WIP-2.patch, TEZ-391-WIP-3.patch > > > We need this for lot of usecases. For cases where multi-query is turned off > and for optimizing unions. Currently those are BROADCAST or ONE-ONE edges and > we write the output multiple times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)