Re: Some question about SQL and streaming

2014-07-11 Thread Tathagata Das
Yes, even though we dont have immediate plans, I definitely would like to
see it happen some time in not-so-distant future.

TD


On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote:

  No specific plans to do so, since there has some functional loss like
 time based windowing function which is important for streaming sql. Also
 keep compatible with fast growing SparkSQL is quite hard. So no clear plans
 to submit to upstream.



 -Jerry



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
 *Sent:* Friday, July 11, 2014 10:47 AM

 *To:* user@spark.apache.org
 *Subject:* Re: Some question about SQL and streaming



 Hi,



 On Fri, Jul 11, 2014 at 11:38 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Actually we have a POC project which shows the power of combining Spark
 Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and
 get SchemaDStream. You can take a look at it:
 https://github.com/thunderain-project/StreamSQL



 Wow, that looks great! Any plans to get this code (or functionality)
 merged into Spark?



 Tobias





Re: Some question about SQL and streaming

2014-07-10 Thread hsy...@gmail.com
Yes, this is what I tried, but thanks!


On Wed, Jul 9, 2014 at 6:02 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Siyuan,

 I do it like this:

 // get data from Kafka
 val ssc = new StreamingContext(...)
 val kvPairs = KafkaUtils.createStream(...)
 // we need to wrap the data in a case class for registerAsTable() to
 succeed
 val lines = kvPairs.map(_._2).map(s = StringWrapper(s))
 val result = lines.transform((rdd, time) = {
   // execute statement
   rdd.registerAsTable(data)
   sqlc.sql(query)
 })

 Don't know if it is the best way, but it works.

 Tobias


 On Thu, Jul 10, 2014 at 4:21 AM, hsy...@gmail.com hsy...@gmail.com
 wrote:

 Hi guys,

 I'm a new user to spark. I would like to know is there an example of how
 to user spark SQL and spark streaming together? My use case is I want to do
 some SQL on the input stream from kafka.
 Thanks!

 Best,
 Siyuan





Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi,

I think it would be great if we could do the string parsing only once and
then just apply the transformation for each interval (reducing the
processing overhead for short intervals).

Also, one issue with the approach above is that transform() has the
following signature:

  def transform(transformFunc: RDD[T] = RDD[U]): DStream[U]

and therefore, in my example

val result = lines.transform((rdd, time) = {
   // execute statement
   rdd.registerAsTable(data)
   sqlc.sql(query)
 })


the variable `result ` is of type DStream[Row]. That is, the
meta-information from the SchemaRDD is lost and, from what I understand,
there is then no way to learn about the column names of the returned data,
as this information is only encoded in the SchemaRDD. I would love to see a
fix for this.

Thanks
Tobias


Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the
schema of all the schemaRDD generated by it can be stored. Something I
really would like to see happen in the future :)

TD


On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 I think it would be great if we could do the string parsing only once and
 then just apply the transformation for each interval (reducing the
 processing overhead for short intervals).

 Also, one issue with the approach above is that transform() has the
 following signature:

   def transform(transformFunc: RDD[T] = RDD[U]): DStream[U]

 and therefore, in my example

 val result = lines.transform((rdd, time) = {
   // execute statement
   rdd.registerAsTable(data)
   sqlc.sql(query)
 })


 the variable `result ` is of type DStream[Row]. That is, the
 meta-information from the SchemaRDD is lost and, from what I understand,
 there is then no way to learn about the column names of the returned data,
 as this information is only encoded in the SchemaRDD. I would love to see a
 fix for this.

 Thanks
 Tobias





RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
Actually we have a POC project which shows the power of combining Spark 
Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get 
SchemaDStream. You can take a look at it: 
https://github.com/thunderain-project/StreamSQL

Thanks
Jerry

From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Friday, July 11, 2014 10:17 AM
To: user@spark.apache.org
Subject: Re: Some question about SQL and streaming

Yeah, the right solution is to have something like SchemaDStream, where the 
schema of all the schemaRDD generated by it can be stored. Something I really 
would like to see happen in the future :)

TD

On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer 
t...@preferred.jpmailto:t...@preferred.jp wrote:
Hi,

I think it would be great if we could do the string parsing only once and then 
just apply the transformation for each interval (reducing the processing 
overhead for short intervals).

Also, one issue with the approach above is that transform() has the following 
signature:

  def transform(transformFunc: RDD[T] = RDD[U]): DStream[U]

and therefore, in my example

val result = lines.transform((rdd, time) = {
  // execute statement
  rdd.registerAsTable(data)
  sqlc.sql(query)
})

the variable `result ` is of type DStream[Row]. That is, the meta-information 
from the SchemaRDD is lost and, from what I understand, there is then no way to 
learn about the column names of the returned data, as this information is only 
encoded in the SchemaRDD. I would love to see a fix for this.

Thanks
Tobias





RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
No specific plans to do so, since there has some functional loss like time 
based windowing function which is important for streaming sql. Also keep 
compatible with fast growing SparkSQL is quite hard. So no clear plans to 
submit to upstream.

-Jerry

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, July 11, 2014 10:47 AM
To: user@spark.apache.org
Subject: Re: Some question about SQL and streaming

Hi,

On Fri, Jul 11, 2014 at 11:38 AM, Shao, Saisai 
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Actually we have a POC project which shows the power of combining Spark 
Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get 
SchemaDStream. You can take a look at it: 
https://github.com/thunderain-project/StreamSQL

Wow, that looks great! Any plans to get this code (or functionality) merged 
into Spark?

Tobias



Re: Some question about SQL and streaming

2014-07-09 Thread Tobias Pfeiffer
Siyuan,

I do it like this:

// get data from Kafka
val ssc = new StreamingContext(...)
val kvPairs = KafkaUtils.createStream(...)
// we need to wrap the data in a case class for registerAsTable() to succeed
val lines = kvPairs.map(_._2).map(s = StringWrapper(s))
val result = lines.transform((rdd, time) = {
  // execute statement
  rdd.registerAsTable(data)
  sqlc.sql(query)
})

Don't know if it is the best way, but it works.

Tobias


On Thu, Jul 10, 2014 at 4:21 AM, hsy...@gmail.com hsy...@gmail.com wrote:

 Hi guys,

 I'm a new user to spark. I would like to know is there an example of how
 to user spark SQL and spark streaming together? My use case is I want to do
 some SQL on the input stream from kafka.
 Thanks!

 Best,
 Siyuan