Hi,

I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition.  Have folks looked at ways of speeding
up this type of ingestion?

Let's say the main part of the ingest process is fetching documents from
somewhere and performing text extraction on them. Is this type of processing
best done by expressing the pipelining with Spark RDD transformations or by
just kicking off a multi-threaded pipeline?

Or, is using a multi-threaded pipeliner per partition is a decent strategy
and the performance comes from running in a clustered mode?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to