time aggregated data and shared structures efficiency

Xavier Daull Thu, 24 Apr 2014 10:41:59 -0700

I have already developed a Python script (not using storm) which transforms
a stream of millions of prices history of different items (provided in 1
common csv) and output dedicate streams for each item with enriched data in
real-time. This script computes and aggregates in real-time latest item
price with past data to get moving average and slop over different
timeframes (month/week/day/hour) and add to it latest data from nearest
items (neighbours). The goal is to feed models for price prediction. In
order to manage time aggregated data and nearest neighbours data I use a
shared buffer of recent data needed for aggregation, latest computed data
for each item and some shared timestamp indexes.


I am wondering if I would really benefit from moving this script to storm
and how.

My first understanding of storm is I should:
- create a dedicated spout class to fetch prices data.
- create a dedicated bolt class to aggregate data (moving average / slopes
/ cross aggregated data between items).

Where should I put my shared buffers and data required to efficiently
aggregate and compute my time aggregated data and nearest neighbours data ?

Will the topology impact performance compared to in-memory data management
? My current script, even if it is in Python, highly benefits from
efficient buffered computation (no recompute, use delta average...), few
data manipulation, minimum access to memory and computation.

Thank you for your advice.
Xavier

time aggregated data and shared structures efficiency

Reply via email to