Hello! Spark Streaming supports HDFS as input source, and also Akka actor receivers, or TCP socket receivers.
For my use case I think it's probably more convenient to read the data directly from Actors, because I already need to set up a multi-node Akka cluster (on the same nodes that Spark runs on) and write some actors to perform some parallel operations. Writing actor receivers to consume the results of my business-logic actors and then feed into Spark is pretty seamless. Note that the actors generate a large amount of data (a few GBs to tens of GBs). The other option would be to setup HDFS on the same cluster as Spark, write the data from the Actors to HDFS, and then use HDFS as input source for Spark Streaming. Does this result in better performance due to data locality (with HDFS data replication turned on)? I think performance should be almost the same with actors, since Spark workers local to the worker actors should get the data fast, and some optimization like this is definitely done I assume? I suppose the only benefit with HDFS would be better fault tolerance, and the ability to checkpoint and recover even if master fails. Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.