Hey guys, first of all, nice job with Spark ;)
I want to you Spark in the following setting and I am not completely sure what the best architecture would be. This is why I would like to ask for your opinion. Job: - read object from input stream - input is a set of ids - map input ids to new ids using real time counts for the input- ids of the objects from a counting store (e.g., key value store) - increment counts of the new ids and decrement count of the old ids in the counting store Now I am not completely sure, which kind of store I should use. The stores are too big to get serialized to all workers, rather I'd like to have it somewhere centralized and accessible for each worker. I think an easy solution would be to create a connection to an (in-memory) database (e.g., memcached) and open a connection in each mapper. But I am not sure how I can do that, because in the map function a connection would be established for each incoming object, which would be a big overhead. So my question is: how can I setup a connection to a db in a mapper without having to create it for each incoming object? Or is there even a simpler solution to my job description? I appreciate your help. Best, Dirk