Re: Streaming with broadcast joins

2016-02-20 Thread Srikanth
Sabastian, *Update:-* This is not possible. Probably will remain this way for the foreseeable future. https://issues.apache.org/jira/browse/SPARK-3863 Srikanth On Fri, Feb 19, 2016 at 10:20 AM, Sebastian Piu wrote: > I don't have the code with me now, and I ended

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
Sure. These may be unrelated. On Fri, Feb 19, 2016 at 10:39 AM, Jerry Lam wrote: > Hi guys, > > I also encounter broadcast dataframe issue not for steaming jobs but > regular dataframe join. In my case, the executors died probably due to OOM > which I don't think it should

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
Hmmm..OK. Srikanth On Fri, Feb 19, 2016 at 10:20 AM, Sebastian Piu wrote: > I don't have the code with me now, and I ended moving everything to RDD in > the end and using map operations to do some lookups, i.e. instead of > broadcasting a Dataframe I ended broadcasting

Re: Streaming with broadcast joins

2016-02-19 Thread Jerry Lam
Hi guys, I also encounter broadcast dataframe issue not for steaming jobs but regular dataframe join. In my case, the executors died probably due to OOM which I don't think it should use that much memory. Anyway, I'm going to craft an example and send it here to see if it is a bug or something

Re: Streaming with broadcast joins

2016-02-19 Thread Sebastian Piu
I don't have the code with me now, and I ended moving everything to RDD in the end and using map operations to do some lookups, i.e. instead of broadcasting a Dataframe I ended broadcasting a Map On Fri, Feb 19, 2016 at 11:39 AM Srikanth wrote: > It didn't fail. It

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
It didn't fail. It wasn't broadcasting. I just ran the test again and here are the logs. Every batch is reading the metadata file. 16/02/19 06:27:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27 16/02/19 06:27:02 INFO HadoopRDD: Input split:

Re: Streaming with broadcast joins

2016-02-19 Thread Sebastian Piu
I don't see anything obviously wrong on your second approach, I've done it like that before and it worked. When you say that it didn't work what do you mean? did it fail? it didnt broadcast? On Thu, Feb 18, 2016 at 11:43 PM Srikanth wrote: > Code with SQL broadcast hint.

Re: Streaming with broadcast joins

2016-02-18 Thread Srikanth
Code with SQL broadcast hint. This worked and I was able to see that broadcastjoin was performed. val testDF = sqlContext.read.format("com.databricks.spark.csv") .schema(schema).load("file:///shared/data/test-data.txt") val lines = ssc.socketTextStream("DevNode", )

Re: Streaming with broadcast joins

2016-02-18 Thread Sebastian Piu
Can you paste the code where you use sc.broadcast ? On Thu, Feb 18, 2016 at 5:32 PM Srikanth wrote: > Sebastian, > > I was able to broadcast using sql broadcast hint. Question is how to > prevent this broadcast for each RDD. > Is there a way where it can be broadcast once

Re: Streaming with broadcast joins

2016-02-18 Thread Srikanth
Sebastian, I was able to broadcast using sql broadcast hint. Question is how to prevent this broadcast for each RDD. Is there a way where it can be broadcast once and used locally for each RDD? Right now every batch the metadata file is read and the DF is broadcasted. I tried sc.broadcast and

Re: Streaming with broadcast joins

2016-02-17 Thread Sebastian Piu
You should be able to broadcast that data frame using sc.broadcast and join against it. On Wed, 17 Feb 2016, 21:13 Srikanth wrote: > Hello, > > I have a streaming use case where I plan to keep a dataset broadcasted and > cached on each executor. > Every micro batch in

Streaming with broadcast joins

2016-02-17 Thread Srikanth
Hello, I have a streaming use case where I plan to keep a dataset broadcasted and cached on each executor. Every micro batch in streaming will create a DF out of the RDD and join the batch. The below code will perform the broadcast operation for each RDD. Is there a way to broadcast it just once?