PySpark Structured Streaming - using previous iteration computed results in current iteration

2018-05-16 Thread Ofer Eliassaf
We would like to utilize maintaining an arbitrary state between invokations of the iterations of StructuredStreaming in python How can we maintain a static DataFrame that acts as state between the iterations? Several options that may be relevant: 1. in Spark memory (distributed across the

Re: Submit many spark applications

2018-05-16 Thread ayan guha
How about using Livy to submit jobs? On Thu, 17 May 2018 at 7:24 am, Marcelo Vanzin wrote: > You can either: > > - set spark.yarn.submit.waitAppCompletion=false, which will make > spark-submit go away once the app starts in cluster mode. > - use the (new in 2.3)

Re: Submit many spark applications

2018-05-16 Thread Marcelo Vanzin
You can either: - set spark.yarn.submit.waitAppCompletion=false, which will make spark-submit go away once the app starts in cluster mode. - use the (new in 2.3) InProcessLauncher class + some custom Java code to submit all the apps from the same "launcher" process. On Wed, May 16, 2018 at 1:45

Submit many spark applications

2018-05-16 Thread Shiyuan
Hi Spark-users, I want to submit as many spark applications as the resources permit. I am using cluster mode on a yarn cluster. Yarn can queue and launch these applications without problems. The problem lies on spark-submit itself. Spark-submit starts a jvm which could fail due to insufficient

Unsubscribe

2018-05-16 Thread varma dantuluri
-- Regards, Varma Dantuluri

Re:

2018-05-16 Thread Nicolas Paris
Hi I would go for a regular mysql bulkload. I m saying writing an output that mysql is able to load in one process. I d'say spark jdbc is ok for small fetch/load. When comes large RDBMS call, it turns out using the regular optimized API is better than jdbc 2018-05-16 16:18 GMT+02:00 Vadim

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-16 Thread Bryan Cutler
Yes, the workaround is to create multiple StringIndexers as you described. OneHotEncoderEstimator is only in Spark 2.3.0, you will have to use just OneHotEncoder. On Tue, May 15, 2018, 8:40 AM Mina Aslani wrote: > Hi, > > So, what is the workaround? Should I create

Structured Streaming Job stops abruptly and No errors logged

2018-05-16 Thread prudhviraj202m
Hello, I have a structured streaming job that consumes messages from kafka and does some stateful associations using flatMapGroupWithState. Every time I submit the job, it runs fine for around 2hours and then stops abruptly without any error messages. All I can see in the debug logs is the below

Re:

2018-05-16 Thread Vadim Semenov
Upon downsizing to 20 partitions some of your partitions become too big, and I see that you're doing caching, and executors try to write big partitions to disk, but fail because they exceed 2GiB > Caused by: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Datafarme save as table operation is failing when the child columns name contains special characters

2018-05-16 Thread abhijeet bedagkar
Hi, I am using SPARK to read the XML / JSON files to create a dataframe and save it as a hive table Sample XML file: 101 45 COMMAND Note field 'validation-timeout' under testexecutioncontroller. Below is the schema populated by DF after reading the XML file |-- id:

Re: Structured Streaming, Reading and Updating a variable

2018-05-16 Thread Martin Engen
I have been testing some with aggregations, but I seem to hit a wall on two issues. example: val avg = areaStateDf.groupBy($"plantKey").avg("sensor") 1) How can I use the result from an aggr within the same stream, to do further calculations? 2) It seems to be very slow. If I want a moving

Re:

2018-05-16 Thread Davide Brambilla
Hi, we have 2 millions of rows using a cluster using an EMR cluster with 8 machines m4.4xlarge with 100GB EBS storage. Davide B. Davide Brambilla ContentWise R ContentWise davide.brambi...@contentwise.tv

OOM: Structured Streaming aggregation state not cleaned up propertly

2018-05-16 Thread weand
We implemented a streaming query with aggregation on event-time with watermark. I'm wondering why aggregation state is not cleanup up. According to documentation old aggregation state should be cleared when using watermarks. We also don't see any condition [1] for why state should not be cleanup

Re:

2018-05-16 Thread Jörn Franke
How many rows do you have in total? > On 16. May 2018, at 11:36, Davide Brambilla > wrote: > > Hi all, >we have a dataframe with 1000 partitions and we need to write the > dataframe into a MySQL using this command: > > df.coalesce(20) >

[no subject]

2018-05-16 Thread Davide Brambilla
Hi all, we have a dataframe with 1000 partitions and we need to write the dataframe into a MySQL using this command: df.coalesce(20) df.write.jdbc(url=url, table=table, mode=mode, properties=properties) and we get this errors randomly

Re: [Java] impact of java 10 on spark dev

2018-05-16 Thread Jörn Franke
First thing would be that scala supports them. Then for other things someone might need to redesign the Spark source code to leverage modules - this could be a rather handy feature to have a small but very well designed core (core, ml, graph etc) around which others write useful modules. > On

Re: [structured-streaming][kafka] Will the Kafka readstream timeout after connections.max.idle.ms 540000 ms ?

2018-05-16 Thread Shixiong(Ryan) Zhu
The streaming query should keep polling data from Kafka. When the query was stopped, did you see any exception? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com [image: http://databricks.com]

Re: Continuous Processing mode behaves differently from Batch mode

2018-05-16 Thread Shixiong(Ryan) Zhu
One possible case is you don't have enough resources to launch all tasks for your continuous processing query. Could you check the Spark UI and see if all tasks are running rather than waiting for resources? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com

[Java] impact of java 10 on spark dev

2018-05-16 Thread xmehaut
Hello, i would like to know what coudl be the impacts of java 10+ on spark. I know that spark is written in scala, but the last versions of java include many improvements, especially in the jvm or in the delivery process (modules, jit, memory mngt, ...) which could benefit to spark. regards