Re: RDD staleness

2015-05-31 Thread Michael Armbrust
Each time you run a Spark SQL query we will create new RDDs that load the data and thus you should see the newest results. There is one caveat: formats that use the native Data Source API (parquet, ORC (in Spark 1.4), JSON (in Spark 1.5)) cache file metadata to speed up interactive querying. To

RDD staleness

2015-05-31 Thread Ashish Mukherjee
Hello, Since RDDs are created from data from Hive tables or HDFS, how do we ensure they are invalidated when the source data is updated? Regards, Ashish

Re: RDD staleness

2015-05-31 Thread DW @ Gmail
There is no mechanism for keeping an RDD up to date with a changing source. However you could set up a steam that watches for changes to the directory and processes the new files or use the Hive integration in SparkSQL to run Hive queries directly. (However, old query results will still grow