Spark and Morphlines, parallelization, multithreading

dgoldenberg Wed, 18 Mar 2015 15:21:11 -0700

Still a Spark noob grappling with the concepts...

I'm trying to grok the idea of integrating something like the Morphlines
pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc
states that "runtime executes all commands of a given morphline in the same
thread...  there are no queues, no handoffs among threads, no context
switches and no serialization between commands, which minimizes performance
overheads."


Further: "There is no need for a morphline to manage multiple processes,
nodes, or threads because this is already addressed by host systems such as
MapReduce, Flume, Spark or Storm."

My question is, how exactly does Spark manage parallelization and
multi-treading aspects of RDD processing?  As I understand it, each
collection of data is split into partitions and each partition is sent over
to a slave machine to perform computations. So, for each data partition, how
many processes are created? And for each process, how many threads?

Knowing that would help me understand how to structure the following:

                JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(
                        jssc,
                        String.class,
                        String.class,
                        StringDecoder.class,
                        StringDecoder.class,
                        kafkaParams,
                        topicsSet);

                ....................

                JavaDStream<String> messageBodies = messages.map(new
Function<Tuple2&lt;String, String>, String>() {
                        @Override
                        public String call(Tuple2<String, String> tuple2) {
                                return tuple2._2();
                        }
                });

Would I want to create a morphline in a 'messages.foreachRDD' block? then
invoke the morphline on each messageBody?

What will Spark be doing behind the scenes as far as multiple processes and
multiple threads? Should I rely on it to optimize performance with multiple
threads and not worry about plugging in a multi-threaded pipelining engine?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Morphlines-parallelization-multithreading-tp22134.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark and Morphlines, parallelization, multithreading

Reply via email to