> same box, data could be piped between two; when they are on separate
> boxes, data can be sent over the network.

The data transfer modes are logically identical, whether the consumer is local 
or remote.

The actual physical transfer follows either network (for remote) or reading 
directly off the output of the previous task (for local).

This is not a FIFO stream record-by-record, but materialized into chunks 
(pipelined shuffle) and moved across.

> (2) An edge property where same output of a vertex
> could be consumed by multiple down-stream vertices.

Do you mean vertices or tasks? 

The runtime physical explosion of vertex into tasks produces an expansion in 
the number of physical connections.

The broadcast edge is the same data going to different tasks of the same 
vertex, like a hash-join. 

In case of wanting to move the same data to more than one logical vertex, that 
does not exist today but the "edge manager plugin" is a user plugin, so a Tez 
user can implement their own edges & outputs in user code to route data however 
they want.

>  (3) An edge property where one node could generate
> two different data streams; one stream goes to one sub-sequent

Yup. See for example the way PIG uses Tez.

http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data/19

or in Hive multi-inserts

<http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/16>

Cheers,
Gopal



Reply via email to