Thanks Vijay! This is very clear. On Tue, Feb 20, 2018 at 12:47 AM, vijay.bvp <bvpsa...@gmail.com> wrote:
> I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API > with some token passed, in the code provided so far if you have 2000 > symbols, it will make 2000 new connections!! and 2000 API calls > connection objects can't/shouldn't be serialized and send to executors, > they > should rather be created at executors. > > the philosophy given below is nicely documented on Spark Streaming, look at > Design Patterns for using foreachRDD > https://spark.apache.org/docs/latest/streaming-programming- > guide.html#output-operations-on-dstreams > > > case class Symbol(symbol: String, sector: String) > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > //assume symbolDs is rdd of symbol and tick dataset/dataframe can be > converted to RDD > symbolRdd.foreachPartition(partition => { > //this code runs at executor > //open a connection here - > val connectionToYahoo = new HTTPConnection() > > partition.foreach(k => { > pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo) > } > } > > with the above code if the dataset has 10 partitions (2000 symbols), only > 10 > connections will be opened though it makes 2000 API calls. > you should also be looking at sending and receiving results for large > number > of symbols, because of the amount of parallelism that spark provides you > might run into rate limit on the APIs. if you are bulk sending symbols > above > pattern also very much useful > > thanks > Vijay > > > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >