We have a table containing 25 features per item id along with feature weights.
A correlation matrix can be constructed for every feature pair based on
co-occurrence. If a user inputs a feature they can find out the features that
are correlated with a self-join requiring a single full table scan. This
results in high latency for big data (10 seconds +) due to the IO involved in
the full table scan. My idea is for this feature the data can be loaded into an
RDD and transformations and actions can be applied to find out per query what
are the correlated features.
I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not
sure about is, is Spark appropriate as a server application? For instance, the
drive application would have to load the RDD and then listen for request and
return results, perhaps using a socket? Are there any libraries to facilitate
this sort of Spark server app? So I understand how Spark can be used to grab
data, run algorithms, and put results back but is it appropriate as the engine
of a server app and what are the general patterns involved?