I have already developed a Python script (not using storm) which transforms a stream of millions of prices history of different items (provided in 1 common csv) and output dedicate streams for each item with enriched data in real-time. This script computes and aggregates in real-time latest item price with past data to get moving average and slop over different timeframes (month/week/day/hour) and add to it latest data from nearest items (neighbours). The goal is to feed models for price prediction. In order to manage time aggregated data and nearest neighbours data I use a shared buffer of recent data needed for aggregation, latest computed data for each item and some shared timestamp indexes.
I am wondering if I would really benefit from moving this script to storm and how. My first understanding of storm is I should: - create a dedicated spout class to fetch prices data. - create a dedicated bolt class to aggregate data (moving average / slopes / cross aggregated data between items). Where should I put my shared buffers and data required to efficiently aggregate and compute my time aggregated data and nearest neighbours data ? Will the topology impact performance compared to in-memory data management ? My current script, even if it is in Python, highly benefits from efficient buffered computation (no recompute, use delta average...), few data manipulation, minimum access to memory and computation. Thank you for your advice. Xavier