Actually we have a POC project which shows the power of combining Spark 
Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get 
SchemaDStream. You can take a look at it: 
https://github.com/thunderain-project/StreamSQL

Thanks
Jerry

From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Friday, July 11, 2014 10:17 AM
To: user@spark.apache.org
Subject: Re: Some question about SQL and streaming

Yeah, the right solution is to have something like SchemaDStream, where the 
schema of all the schemaRDD generated by it can be stored. Something I really 
would like to see happen in the future :)

TD

On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer 
<t...@preferred.jp<mailto:t...@preferred.jp>> wrote:
Hi,

I think it would be great if we could do the string parsing only once and then 
just apply the transformation for each interval (reducing the processing 
overhead for short intervals).

Also, one issue with the approach above is that transform() has the following 
signature:

  def transform(transformFunc: RDD[T] => RDD[U]): DStream[U]

and therefore, in my example

val result = lines.transform((rdd, time) => {
  // execute statement
  rdd.registerAsTable("data")
  sqlc.sql(query)
})

the variable `result ` is of type DStream[Row]. That is, the meta-information 
from the SchemaRDD is lost and, from what I understand, there is then no way to 
learn about the column names of the returned data, as this information is only 
encoded in the SchemaRDD. I would love to see a fix for this.

Thanks
Tobias



Reply via email to