Thank you so much, this is a really well described answer :) On Sun, Jan 31, 2016 at 7:57 PM Hitesh Shah <[email protected]> wrote:
> There are 3 types defined as you have noticed: > > persisted_reliable: assumes a vertex output is stored in a reliable store > like HDFS. This states that if the node on which the task ran disappears, > the output is still available. > persisted: vertex output stored on local disk where the task ran. > ephemeral: vertex output stored in task memory. > > From a data transmission point of view, all data is always transmitted > over network unless there is a case where the downstream task is running on > same machine as the task that generated the output. In that case, it can > read from local disk if needed. > > You are right that the in-memory support is not built out so a co-located > task potentially reading from another task’s memory is therefore not > supported today. The network channel requires a bit more explanation. > Generally, all data is persisted to disk. This means that data transferred > over the network is first written locally and then eventually pulled/pushed > to a downstream task as needed. This does not mean that all the data needs > to be generated first before being sent downstream. Data can be still > generated in “chunks” and then sent downstream as when as a chunk becomes > available. ( this functionality is internally called “pipelined shuffle” if > you end up searching through the code/jiras ). However, again to clarify, > there is no pure streaming support yet where data is kept in memory and > pushed downstream. > > To add, the in-memory approach requires changing Tez to have a different > fault tolerance model to be applied and it also needs more cluster capacity > to ensure that both upstream and downstream tasks can run concurrently. Do > you see this as a requirement for something that you are planning to use > Tez for? > > thanks > — Hitesh > > On Jan 31, 2016, at 12:30 AM, Gal Vinograd <[email protected]> wrote: > > > Hey, > > > > I read though tez-api code and noticed that PERSISTED is the only stable > data source type. Does that mean that data isn't transmitted between > vertices though in-memory or network channels? > > > > Thanks :) > >
