Hi,

I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.

The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.

For example, if I want to count the lines which contains word "Spark", I
just need to map every line to see if it contains word, then reduce it with
a sum function. After that, this line is no longer useful to keep it in
memory.

That is said, if the DStream only have one map and/or reduce operation on
it. It is not necessary to keep all the batch data in the memory. Something
like a pipeline should be OK.

Is it difficult to implement on top of the current implementation?

Thanks.

---
Bin Wang

Reply via email to