Github user sameeragarwal commented on the issue:
https://github.com/apache/spark/pull/13419
I ended up creating a small design doc describing the problem and
presenting 2 possible solutions at
https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574
Github user sameeragarwal commented on the issue:
https://github.com/apache/spark/pull/13419
@tejasapatil if the nodes where the data was cached go down, the
CacheManager should still consider that data as cached. In that case, the next
time the data is accessed, the underlying RDD wi
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/13419
I guess that the caching is done over multiple nodes. If the data for a
dataset is updated physically and some of the nodes where the data was cached
go down, would the existing `cached` dataset