Each time you run a Spark SQL query we will create new RDDs that load the
data and thus you should see the newest results. There is one caveat:
formats that use the native Data Source API (parquet, ORC (in Spark 1.4),
JSON (in Spark 1.5)) cache file metadata to speed up interactive querying.
To
Hello,
Since RDDs are created from data from Hive tables or HDFS, how do we ensure
they are invalidated when the source data is updated?
Regards,
Ashish
There is no mechanism for keeping an RDD up to date with a changing source.
However you could set up a steam that watches for changes to the directory and
processes the new files or use the Hive integration in SparkSQL to run Hive
queries directly. (However, old query results will still grow