Beam Java users,

I've run into a few cases where I want to present a single thread-safe data
structure to all threads on a worker, and I end up writing a good bit of
custom code each time involving a synchronized method that handles creating
the resource exactly once, and then each thread has its own reference to
the singleton. I don't have extensive experience with thread safety in
Java, so it seems likely I'm going to get this wrong.

Are there any best practices for state that is shared across threads? Any
prior art I can read up on?

The most concrete case I have in mind is loading a GeoIP database for doing
city lookups from IP addresses. We're using MaxMind's API which allows
mapping a portion of memory to a file sitting on disk. We have a
synchronized method that checks if the reader has been initialized [0] ; if
not, we copy the database file from GCS to local disk, build the
DatabaseReader instance, and return it. Other threads will see the
already-initialized and just get a reference to it instead.

This all appears to work, and it saves memory compared to each thread
maintaining their own DatabaseReader. But is there a safer or more built-in
way to do this? Am I missing relevant hooks in the Beam API that would make
this cleaner?

[0]
https://github.com/mozilla/gcp-ingestion/blob/master/ingestion-beam/src/main/java/com/mozilla/telemetry/decoder/GeoCityLookup.java#L95

Reply via email to