> same box, data could be piped between two; when they are on separate > boxes, data can be sent over the network.
The data transfer modes are logically identical, whether the consumer is local or remote. The actual physical transfer follows either network (for remote) or reading directly off the output of the previous task (for local). This is not a FIFO stream record-by-record, but materialized into chunks (pipelined shuffle) and moved across. > (2) An edge property where same output of a vertex > could be consumed by multiple down-stream vertices. Do you mean vertices or tasks? The runtime physical explosion of vertex into tasks produces an expansion in the number of physical connections. The broadcast edge is the same data going to different tasks of the same vertex, like a hash-join. In case of wanting to move the same data to more than one logical vertex, that does not exist today but the "edge manager plugin" is a user plugin, so a Tez user can implement their own edges & outputs in user code to route data however they want. > (3) An edge property where one node could generate > two different data streams; one stream goes to one sub-sequent Yup. See for example the way PIG uses Tez. http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data/19 or in Hive multi-inserts <http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/16> Cheers, Gopal
