Have a look at the various versions of PairDStreamFunctions.updateStateByWindow ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). It supports updating running state in memory. (You can persist the state to a database/files periodically if you want). Use an in-memory data structure like a hash map with SKU-price key-values. Update the map as needed on each iteration. One of the versions of this function lets you specify a partitioner if you still need to shard keys.
Also, I would be flexible about the 1 second batch interval. Is that really a mandatory requirement for this problem? HTH, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Mon, Aug 10, 2015 at 5:24 AM, sequoiadb <mailing-list-r...@sequoiadb.com> wrote: > hi guys, > > i have a question about spark streaming. > There’s an application keep sending transaction records into spark stream > with about 50k tps > The record represents a sales information including customer id / product > id / time / price columns > > The application is required to monitor the change of price for each > product. For example, if the price of a product increases 10% within 3 > minutes, it will send an alert to end user. > > The interval is required to be set every 1 second, window is somewhere > between 180 to 300 seconds. > > The issue is that I have to compare the price of each transaction ( > totally about 10k different products ) against the lowest/highest price for > the same product in the all past 180 seconds. > > That means, in every single second, I have to loop through 50k > transactions and compare the price of the same product in all 180 seconds. > So it seems I have to separate the calculation based on product id, so that > each worker only processes a certain list of products. > > For example, if I can make sure the same product id always go to the same > worker agent, it doesn’t need to shuffle data between worker agent for each > comparison. Otherwise if it required to compare each transaction with all > other RDDs that cross multiple worker agent, I guess it may not be fast > enough for the requirement. > > Is there anyone knows how to specify the worker node for each transaction > record based on its product id, in order to avoid massive shuffle operation? > > If simply making the product id as the key and price as the value, > reduceByKeyAndWindow may cause massive shuffle and slow down the whole > throughput. Am I correct? > > Thanks > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >