Have a look at the various versions of
PairDStreamFunctions.updateStateByWindow (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
It supports updating running state in memory. (You can persist the state to
a database/files periodically if you want). Use an in-memory data structure
like a hash map with SKU-price key-values. Update the map as needed on each
iteration. One of the versions of this function lets you specify a
partitioner if you still need to shard keys.

Also, I would be flexible about the 1 second batch interval. Is that really
a mandatory requirement for this problem?

HTH,
dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, Aug 10, 2015 at 5:24 AM, sequoiadb <mailing-list-r...@sequoiadb.com>
wrote:

> hi guys,
>
> i have a question about spark streaming.
> There’s an application keep sending transaction records into spark stream
> with about 50k tps
> The record represents a sales information including customer id / product
> id / time / price columns
>
> The application is required to monitor the change of price for each
> product. For example, if the price of a product increases 10% within 3
> minutes, it will send an alert to end user.
>
> The interval is required to be set every 1 second, window is somewhere
> between 180 to 300 seconds.
>
> The issue is that I have to compare the price of each transaction (
> totally about 10k different products ) against the lowest/highest price for
> the same product in the all past 180 seconds.
>
> That means, in every single second, I have to loop through 50k
> transactions and compare the price of the same product in all 180 seconds.
> So it seems I have to separate the calculation based on product id, so that
> each worker only processes a certain list of products.
>
> For example, if I can make sure the same product id always go to the same
> worker agent, it doesn’t need to shuffle data between worker agent for each
> comparison. Otherwise if it required to compare each transaction with all
> other RDDs that cross multiple worker agent, I guess it may not be fast
> enough for the requirement.
>
> Is there anyone knows how to specify the worker node for each transaction
> record based on its product id, in order to avoid massive shuffle operation?
>
> If simply making the product id as the key and price as the value,
> reduceByKeyAndWindow may cause massive shuffle and slow down the whole
> throughput. Am I correct?
>
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to