I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API with some token passed, in the code provided so far if you have 2000 symbols, it will make 2000 new connections!! and 2000 API calls connection objects can't/shouldn't be serialized and send to executors, they should rather be created at executors.
the philosophy given below is nicely documented on Spark Streaming, look at Design Patterns for using foreachRDD https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) //assume symbolDs is rdd of symbol and tick dataset/dataframe can be converted to RDD symbolRdd.foreachPartition(partition => { //this code runs at executor //open a connection here - val connectionToYahoo = new HTTPConnection() partition.foreach(k => { pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo) } } with the above code if the dataset has 10 partitions (2000 symbols), only 10 connections will be opened though it makes 2000 API calls. you should also be looking at sending and receiving results for large number of symbols, because of the amount of parallelism that spark provides you might run into rate limit on the APIs. if you are bulk sending symbols above pattern also very much useful thanks Vijay -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org