Hello everyone! I am solving the task in which every cluster node, executing spark job, should have access to a big external file. The file is MaxMind GeoIP database and its size is around 15 megabytes. MaxMind's provided library permanently uses it for reading with random access. Of course, it just can be stored in hdfs, but accessing it for random reading will be quite inefficient.
Hadoop mapreduce has DistributedCache module dedicated for this purpose. We can specify files in hdfs that will be required during job execution and they are copied to worker nodes before the job starts. So the job will efficiently access their copies on local machine. I didn't found simple and effective way of doing the same thing in spark. Is there any preferable way to do so? -- Best regards, Konstantin Abakumov
