Re: Two Nodes :SparkContext Null Pointer

2017-04-10 Thread Sriram
Fixed it by submitting the second job as a child process. Thanks, Sriram. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Two-Nodes-SparkContext-Null-Pointer-tp28582p28585.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Dataframes na fill with empty list

2017-04-10 Thread Sumona Routh
Hi there, I have two dataframes that each have some columns which are of list type (array generated by the collect_list function actually). I need to outer join these two dfs, however by nature of an outer join I am sometimes left with null values. Normally I would use df.na.fill(...), however it

Re: Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-10 Thread Tathagata Das
As of now (Spark 2.2), Structured Streaming does checkpoint of the state data synchronously in every trigger. But the checkpointing is incremental, so it wont be writing all your state every time. And we will be making this asynchronous soon. On Fri, Apr 7, 2017 at 3:19 AM, kant kodali

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread kant kodali
Perfect! Thanks a lot. On Mon, Apr 10, 2017 at 1:39 PM, Tathagata Das wrote: > The trigger interval is optionally specified in the writeStream option > before start. > > val windowedCounts = words.groupBy( > window($"timestamp", "24 hours", "24 hours"), >

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread Tathagata Das
The trigger interval is optionally specified in the writeStream option before start. val windowedCounts = words.groupBy( window($"timestamp", "24 hours", "24 hours"), $"word" ).count() .writeStream .trigger(ProcessingTime("10 seconds")) // optional .format("memory") .queryName("tableName")

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread kant kodali
Thanks again! Looks like the update mode is not available in 2.1 (which seems to be the latest version as of today) and I am assuming there will be a way to specify trigger interval with the next release because with the following code I don't see a way to specify trigger interval. val

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread Michael Armbrust
It sounds like you want a tumbling window (where the slide and duration are the same). This is the default if you give only one interval. You should set the output mode to "update" (i.e. output only the rows that have been updated since the last trigger) and the trigger to "1 second". Try

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread kant kodali
Hi Michael, Thanks for the response. I guess I was thinking more in terms of the regular streaming model. so In this case I am little confused what my window interval and slide interval be for the following case? I need to hold a state (say a count) for 24 hours while capturing all its updates

Re: Cant convert Dataset to case class with Option fields

2017-04-10 Thread Michael Armbrust
Options should work. Can you give a full example that is freezing? Which version of Spark are you using? On Fri, Apr 7, 2017 at 6:59 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hi Devs, > I've some case classes here, and it's fields are all optional > case class

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread Michael Armbrust
Nope, structured streaming eliminates the limitation that micro-batching should affect the results of your streaming query. Trigger is just an indication of how often you want to produce results (and if you leave it blank we just run as quickly as possible). To control how tuples are grouped

Re: pandas DF Dstream to Spark DF

2017-04-10 Thread Bryan Cutler
Hi Yogesh, It would be easier to help if you included your code and the exact error messages that occur. If you are creating a Spark DataFrame with a Pandas DataFrame, then Spark does not read the schema and infers from the data to make one. This might be the cause of your issue if the schema

Re: Assigning a unique row ID

2017-04-10 Thread Everett Anderson
Indeed, I tried persist with MEMORY_AND_DISK and it works! (I'm wary of MEMORY_ONLY for this as it could potentially recompute shards if it couldn't entirely cache in memory.) Thanks for the help, everybody!! On Sat, Apr 8, 2017 at 11:54 AM, Everett Anderson wrote: > > > On

Re: unit testing in spark

2017-04-10 Thread Jörn Franke
I think in the end you need to check the coverage of your application. If your application is well covered on the job or pipeline level (depends however on how you implement these tests) then it can be fine. In the end it really depends on the data and what kind of transformation you

Re: unit testing in spark

2017-04-10 Thread Gokula Krishnan D
Hello Shiv, Unit Testing is really helping when you follow TDD approach. And it's a safe way to code a program locally and also you can make use those test cases during the build process by using any of the continuous integration tools ( Bamboo, Jenkins). If so you can ensure that artifacts are

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-10 Thread Md. Rezaul Karim
Hi Yan, Ryan, and Nick, Actually, for a special use case, I had to use RDD-based Spark MLlib which did not work eventually. Therefore, I had to switch to Spark ML later on. Thanks for your support guys. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher,

Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-10 Thread Ofir Manor
Also check SPARK-19478 - JDBC sink (seems to be waiting for a review) Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Mon, Apr 10, 2017 at 10:10 AM, Hemanth Gudela

Two Nodes :SparkContext Null Pointer

2017-04-10 Thread Sriram
Hello Everyone, Need support on this scenario, We have two masters and three worker nodes all configured in standalone cluster. There are two jobs deployed in all the worker nodes. one job Quartz scheduler and the other job is some exporting application. Quartz scheduler job is submitted from

Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-10 Thread Hemanth Gudela
Many thanks Silvio for the link. That’s exactly what I’m looking for. ☺ However there is no mentioning of checkpoint support for custom “ForeachWriter” in structured streaming. I’m going to test that now. Good question Gary, this is the mentioning in the