Hello!

Spark Streaming supports HDFS as input source, and also Akka actor
receivers, or TCP socket receivers.

For my use case I think it's probably more convenient to read the data
directly from Actors, because I already need to set up a multi-node Akka
cluster (on the same nodes that Spark runs on) and write some actors to
perform some parallel operations. Writing actor receivers to consume the
results of my business-logic actors and then feed into Spark is pretty
seamless. Note that the actors generate a large amount of data (a few GBs to
tens of GBs).

The other option would be to setup HDFS on the same cluster as Spark, write
the data from the Actors to HDFS, and then use HDFS as input source for
Spark Streaming. Does this result in better performance due to data locality
(with HDFS data replication turned on)? I think performance should be almost
the same with actors, since Spark workers local to the worker actors should
get the data fast, and some optimization like this is definitely done I
assume?

I suppose the only benefit with HDFS would be better fault tolerance, and
the ability to checkpoint and recover even if master fails.

Cheers,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to