Still a Spark noob grappling with the concepts... I'm trying to grok the idea of integrating something like the Morphlines pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc states that "runtime executes all commands of a given morphline in the same thread... there are no queues, no handoffs among threads, no context switches and no serialization between commands, which minimizes performance overheads."
Further: "There is no need for a morphline to manage multiple processes, nodes, or threads because this is already addressed by host systems such as MapReduce, Flume, Spark or Storm." My question is, how exactly does Spark manage parallelization and multi-treading aspects of RDD processing? As I understand it, each collection of data is split into partitions and each partition is sent over to a slave machine to perform computations. So, for each data partition, how many processes are created? And for each process, how many threads? Knowing that would help me understand how to structure the following: JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); .................... JavaDStream<String> messageBodies = messages.map(new Function<Tuple2<String, String>, String>() { @Override public String call(Tuple2<String, String> tuple2) { return tuple2._2(); } }); Would I want to create a morphline in a 'messages.foreachRDD' block? then invoke the morphline on each messageBody? What will Spark be doing behind the scenes as far as multiple processes and multiple threads? Should I rely on it to optimize performance with multiple threads and not worry about plugging in a multi-threaded pipelining engine? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Morphlines-parallelization-multithreading-tp22134.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org