Hi, I'm looking at a data ingestion implementation which streams data out of Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to process the data in each partition. Have folks looked at ways of speeding up this type of ingestion?
Let's say the main part of the ingest process is fetching documents from somewhere and performing text extraction on them. Is this type of processing best done by expressing the pipelining with Spark RDD transformations or by just kicking off a multi-threaded pipeline? Or, is using a multi-threaded pipeliner per partition is a decent strategy and the performance comes from running in a clustered mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org