hi guys,

i have a question about spark streaming.
There’s an application keep sending transaction records into spark stream with 
about 50k tps
The record represents a sales information including customer id / product id / 
time / price columns

The application is required to monitor the change of price for each product. 
For example, if the price of a product increases 10% within 3 minutes, it will 
send an alert to end user.

The interval is required to be set every 1 second, window is somewhere between 
180 to 300 seconds.

The issue is that I have to compare the price of each transaction ( totally 
about 10k different products ) against the lowest/highest price for the same 
product in the all past 180 seconds. 

That means, in every single second, I have to loop through 50k transactions and 
compare the price of the same product in all 180 seconds. So it seems I have to 
separate the calculation based on product id, so that each worker only 
processes a certain list of products.

For example, if I can make sure the same product id always go to the same 
worker agent, it doesn’t need to shuffle data between worker agent for each 
comparison. Otherwise if it required to compare each transaction with all other 
RDDs that cross multiple worker agent, I guess it may not be fast enough for 
the requirement.

Is there anyone knows how to specify the worker node for each transaction 
record based on its product id, in order to avoid massive shuffle operation?

If simply making the product id as the key and price as the value, 
reduceByKeyAndWindow may cause massive shuffle and slow down the whole 
throughput. Am I correct?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to