Hi Theo,

We had a very similar problem with one of our spark streaming jobs. Best
solution was to create a custom source having all external records in
cache, periodically reading external data and comparing it to cache. All
changed records were then broadcasted to task managers. We tried to
implement background loading in separate thread, but this solution was more
complicated, we needed to create shadow copy of cache and then quickly
switch them. And with spark streaming there were additional problems.

Hope this helps,
Maxim.

Reply via email to