Actually we have a POC project which shows the power of combining Spark Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get SchemaDStream. You can take a look at it: https://github.com/thunderain-project/StreamSQL
Thanks Jerry From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Friday, July 11, 2014 10:17 AM To: user@spark.apache.org Subject: Re: Some question about SQL and streaming Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer <t...@preferred.jp<mailto:t...@preferred.jp>> wrote: Hi, I think it would be great if we could do the string parsing only once and then just apply the transformation for each interval (reducing the processing overhead for short intervals). Also, one issue with the approach above is that transform() has the following signature: def transform(transformFunc: RDD[T] => RDD[U]): DStream[U] and therefore, in my example val result = lines.transform((rdd, time) => { // execute statement rdd.registerAsTable("data") sqlc.sql(query) }) the variable `result ` is of type DStream[Row]. That is, the meta-information from the SchemaRDD is lost and, from what I understand, there is then no way to learn about the column names of the returned data, as this information is only encoded in the SchemaRDD. I would love to see a fix for this. Thanks Tobias