Iterator-based streaming, how is it efficient ?

Samuel Hailu Mon, 21 Sep 2015 16:06:07 -0700

Hi,

In Spark's in-memory logic, without cache, elements are accessed in an
iterator-based streaming style [
http://www.slideshare.net/liancheng/dtcc-14-spark-runtime-internals?next_slideshow=1
]


I have two questions:


   1. if elements are read one line at at time from HDFS (disk) and then
   transformed based on the rdd operations, how is this efficient?
   2. which class in the Spark source does this? I'm expecting some kind of:

           for (partition_index <- iterator_over_a_partition)
               read_hdfs_line(partition_index).apply_tranformation()


Thanks,

Iterator-based streaming, how is it efficient ?

Reply via email to