Hi all according to the vldb 09 paper, the split operator and all its successive operators reside in memory without any blocking in between. However, the source code (version 0.7) shows that a MR job is actually ended when it meets the split operator and multiple new MR jobs are created, each representing one branch. This write-once-read-multiple-times method is different from the in-memory method mentioned in that paper. Does pig change the strategy for split, or is there still an in-memory version of split I didn't discover?
Thanks, -Gang