I've been under the impression that creating and registering a parquet table will pick up on updates to the table, such as inserts. I have a program running that does the following:
// Create Context val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Register table sqlContext .parquetFile("hdfs://somewhere/users/sql/") .registerAsTable("mytable") This program is continuously running. Over time, queries get fired off to that sqlContext: // Query the registered table, collect and return sqlContext.sql(query) .collect() Then, elsewhere, I have processes which inserts data into that same table, like so: // Create context val ssc = new StreamingContext(conf, Seconds(3600)) val sqlContext = new SQLContext(ssc.sparkContext) // Register table createParquetFile[Row]("hdfs://somewhere/users/sql/") .registerAsTable("mytable") // Insert into (rdd exists and is filled with type Row) createSchemaRDD[Row](rdd) .coalesce(1) .insertInto("mytable") I've made a local test where it is the case that the first program will be aware of the changes the second program makes. But when deploying with real data, outside of that local test case, the running table "mytable" doesn't get updated. If I kill the query program and restart it, it refreshes to the current state of "mytable". Thoughts? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org