Hey guys,

first of all, nice job with Spark ;)

I want to you Spark in the following setting and I am not completely sure
what the best architecture would be. This is why I would like to ask for
your opinion.

Job:

- read object from input stream
- input is a set of ids
- map input ids to new ids using real time counts for the input-  ids of
the objects from a counting store (e.g., key value store)
- increment counts of the new ids and decrement count of the old ids in the
counting store


Now I am not completely sure, which kind of store I should use. The stores
are too big to get serialized to all workers, rather I'd like to have it
somewhere centralized and accessible for each worker. I think an easy
solution would be to create a connection to an (in-memory) database (e.g.,
memcached) and open a connection in each mapper. But I am not sure how I
can do that, because in the map function a connection would be established
for each incoming object, which would be a big overhead. So my question is:
how can I setup a connection to a db in a mapper without having to create
it for each incoming object? Or is there even a simpler solution to my job
description?

I appreciate your help.

Best,
Dirk

Reply via email to