How to speed up data ingestion with Spark

dgoldenberg Tue, 12 May 2015 07:56:41 -0700

Hi,

I'm looking at a data ingestion implementation which streams data out of
Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
process the data in each partition.  Have folks looked at ways of speeding
up this type of ingestion?


Let's say the main part of the ingest process is fetching documents from
somewhere and performing text extraction on them. Is this type of processing
best done by expressing the pipelining with Spark RDD transformations or by
just kicking off a multi-threaded pipeline?

Or, is using a multi-threaded pipeliner per partition is a decent strategy
and the performance comes from running in a clustered mode?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to speed up data ingestion with Spark

Reply via email to