Re: Trying to run SparkSQL over Spark Streaming
Hi, I'm also trying to use the insertInto method, but end up getting the assertion error Is there any workaround to this?? rgds -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p21316.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Thanks for the reply. Sorry I could not ask more earlier. Trying to use a parquet file is not working at all. case class Rec(name:String,pv:Int) val sqlContext=new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val d1=sc.parallelize(Array((a,10),(b,3))).map(e=Rec(e._1,e._2)) d1.saveAsParquetFile(p1.parquet) val d2=sc.parallelize(Array((a,10),(b,3),(c,5))).map(e=Rec(e._1,e._2)) d2.saveAsParquetFile(p2.parquet) val f1=sqlContext.parquetFile(p1.parquet) val f2=sqlContext.parquetFile(p2.parquet) f1.registerAsTable(logs) f2.insertInto(logs) is giving the error : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable Map(), false same as the one that occured while trying to insert into from rdd to table. So i guess inserting into parquet tables is also not supported? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p13004.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Thanks for the reply. Ya it doesn't seem doable straight away. Someone suggested this /For each of your streams, first create an emty RDD that you register as a table, obtaining an empty table. For your example, let's say you call it allTeenagers. Then, for each of your queries, use SchemaRDD's insertInto method to add the result to that table: teenagers.insertInto(allTeenagers) If you do this with both your streams, creating two separate accumulation tables, you can then join them using a plain old SQL query. / So I was trying it but can't seem to use the insertInto method in the correct way. Something like: var p1 = Person(Hari,22) val rdd1 = sc.parallelize(Array(p1)) rdd1.registerAsTable(data) var p2 = Person(sagar, 22) var rdd2 = sc.parallelize(Array(p2)) rdd2.insertInto(data) is giving the error : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable Map(), false Any thoughts? Thanks Hi again, On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer lt;tgp@gt; wrote: On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1991@ wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown Since this is the case then is there any way to run join over data received from two different streams? Couldn't you do dstream1.join(dstream2).foreachRDD(...)? Ah, I guess you meant something like SELECT * FROM dstream1 JOIN dstream2 WHERE ...? I don't know if that is possible. Doesn't seem easy to me; I don't think that's doable with the current codebase... Tobias -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p12812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Hi, Thanks for your help the other day. I had one more question regarding the same. If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown Since this is the case then is there any way to run join over data received from two different streams? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p12739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Hi, On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown Since this is the case then is there any way to run join over data received from two different streams? Couldn't you do dstream1.join(dstream2).foreachRDD(...)? Tobias
Re: Trying to run SparkSQL over Spark Streaming
Hi again, On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer t...@preferred.jp wrote: On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown Since this is the case then is there any way to run join over data received from two different streams? Couldn't you do dstream1.join(dstream2).foreachRDD(...)? Ah, I guess you meant something like SELECT * FROM dstream1 JOIN dstream2 WHERE ...? I don't know if that is possible. Doesn't seem easy to me; I don't think that's doable with the current codebase... Tobias
Re: Trying to run SparkSQL over Spark Streaming
Hi Thanks for the reply and the link. Its working now. From the discussion on the link, I understand that there are some shortcomings while using SQL over streaming. The part that you mentioned */the variable `result ` is of type DStream[Row]. That is, the meta-information from the SchemaRDD is lost and, from what I understand, there is then no way to learn about the column names of the returned data, as this information is only encoded in the SchemaRDD/* Why is this bad?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p12535.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Hi, On Thu, Aug 21, 2014 at 3:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: The part that you mentioned */the variable `result ` is of type DStream[Row]. That is, the meta-information from the SchemaRDD is lost and, from what I understand, there is then no way to learn about the column names of the returned data, as this information is only encoded in the SchemaRDD/* Because a DStream[Row] is basically like DStream[Array[Object]]. You lose all the information about data types in your result and there is no way to recover it, once the schema is inaccessible. If you want to process the data later on, you will have to check types and make assertions about the statements that were issued before etc. Tobias
Re: Trying to run SparkSQL over Spark Streaming
Oh right. Got it. Thanks Also found this link on that discussion: https://github.com/thunderain-project/StreamSQL Does this provide more features than Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p12538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Trying to run SparkSQL over Spark Streaming
Hi, StreamSQL (https://github.com/thunderain-project/StreamSQL) is a POC project based on Spark to combine the power of Catalyst and Spark Streaming, to offer people the ability to manipulate SQL on top of DStream as you wanted, this keep the same semantics with SparkSQL as offer a SchemaDStream on top of DStream. You don't need to do tricky thing like extracting rdd to register as a table. Besides other parts are the same as Spark. Thanks Jerry -Original Message- From: praveshjain1991 [mailto:praveshjain1...@gmail.com] Sent: Thursday, August 21, 2014 2:25 PM To: u...@spark.incubator.apache.org Subject: Re: Trying to run SparkSQL over Spark Streaming Oh right. Got it. Thanks Also found this link on that discussion: https://github.com/thunderain-project/StreamSQL Does this provide more features than Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p12538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Trying to run SparkSQL over Spark Streaming
Hi, On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 praveshjain1...@gmail.com wrote: Using Spark SQL with batch data works fine so I'm thinking it has to do with how I'm calling streamingcontext.start(). Any ideas what is the issue? Here is the code: Please have a look at http://apache-spark-user-list.1001560.n3.nabble.com/Some-question-about-SQL-and-streaming-tp9229p9254.html If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be unknown. Tobias