Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hello!

Spark Streaming supports HDFS as input source, and also Akka actor
receivers, or TCP socket receivers.

For my use case I think it's probably more convenient to read the data
directly from Actors, because I already need to set up a multi-node Akka
cluster (on the same nodes that Spark runs on) and write some actors to
perform some parallel operations. Writing actor receivers to consume the
results of my business-logic actors and then feed into Spark is pretty
seamless. Note that the actors generate a large amount of data (a few GBs to
tens of GBs).

The other option would be to setup HDFS on the same cluster as Spark, write
the data from the Actors to HDFS, and then use HDFS as input source for
Spark Streaming. Does this result in better performance due to data locality
(with HDFS data replication turned on)? I think performance should be almost
the same with actors, since Spark workers local to the worker actors should
get the data fast, and some optimization like this is definitely done I
assume?

I suppose the only benefit with HDFS would be better fault tolerance, and
the ability to checkpoint and recover even if master fails.

Cheers,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
Hey Nilesh,

Great to hear your using Spark Streaming, in my opinion the crux of your
question comes down to what you want to do with the data in the future
and/or if there is utility it using it from more than one Spark/Streaming
job.

1). *One-time-use fire and forget *- as you rightly point out, hooking up
to the Akka actors makes sense if the usefulness of the data is short-lived
and you don't need the ability to readily go back into archived data.

2). *Fault tolerance  multiple uses* - consider using a message queue like
Apache Kafka [1], write messages from your Akka Actors into a Kafka topic
with multiple partitions and replication.  Then use Spark Streaming job(s)
to read from Kafka.  You can tune Kafka to keep the last *N* days data
online so if your Spark Streaming job dies it can pickup at the point it
left off.

3). *Keep indefinitely* - files in HDFS, 'nuff said.

We're currently using (2) Kafka  (3) HDFS to process around 400M web
clickstream events a week.  Everything is written into Kafka and kept
'online' for 7 days, and also written out to HDFS in compressed
date-sequential files.

We use several Spark Streaming jobs to process the real-time events
straight from Kafka.  Kafka supports multiple consumers so each job sees
his own view of the message queue and all its events.  If any of the
Streaming jobs die or are restarted they continue consuming from Kafka from
the last processed message without effecting any of the other consumer
processes.

Best,

MC


[1] http://kafka.apache.org/



On 10 June 2014 13:05, Nilesh Chakraborty nil...@nileshc.com wrote:

 Hello!

 Spark Streaming supports HDFS as input source, and also Akka actor
 receivers, or TCP socket receivers.

 For my use case I think it's probably more convenient to read the data
 directly from Actors, because I already need to set up a multi-node Akka
 cluster (on the same nodes that Spark runs on) and write some actors to
 perform some parallel operations. Writing actor receivers to consume the
 results of my business-logic actors and then feed into Spark is pretty
 seamless. Note that the actors generate a large amount of data (a few GBs
 to
 tens of GBs).

 The other option would be to setup HDFS on the same cluster as Spark, write
 the data from the Actors to HDFS, and then use HDFS as input source for
 Spark Streaming. Does this result in better performance due to data
 locality
 (with HDFS data replication turned on)? I think performance should be
 almost
 the same with actors, since Spark workers local to the worker actors should
 get the data fast, and some optimization like this is definitely done I
 assume?

 I suppose the only benefit with HDFS would be better fault tolerance, and
 the ability to checkpoint and recover even if master fails.

 Cheers,
 Nilesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.