Re: Use Arrow instead of Pickle without pandas_udf

2018-07-30 Thread Bryan Cutler
Here is a link to the JIRA for adding StructType support for scalar pandas_udf https://issues.apache.org/jira/browse/SPARK-24579 On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi wrote: > Hey Holden, > Thanks for your reply, > > We currently using a python function that produces a

Executor lost for unknown reasons error Spark 2.3 on kubernetes

2018-07-30 Thread purna pradeep
Hello, I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel. Not able to see executor logs as executor pods

Executor lost for unknown reasons error Spark 2.3 on kubernetes

2018-07-30 Thread Mamillapalli, Purna Pradeep
Hello, I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel. Not able to see executor logs as executor pods

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread Vadim Semenov
object MyDatabseSingleton { @transient lazy val dbConn = DB.connect(…) `transient` marks the variable to be excluded from serialization and `lazy` would open connection only when it's needed and also makes sure that the val is thread-safe

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread kant kodali
Hi Patrick, This object must be serializable right? I wonder if I will access to this object in my driver(since it is getting created on the executor side) so I can close when I am done with my batch? Thanks! On Mon, Jul 30, 2018 at 7:37 AM, Patrick McGloin wrote: > You could use an object in

sorting on dataframe causes out of memory (java heap space)

2018-07-30 Thread msbreuer
While working with larger datasets I run into out of memory issues. Basically a hadoop sequence file is read, its contents are sorted and a hadoop map file is written back. Code works fine for workloads greater than 20gb. Than I changed one column in my dataset to store a large object and size of

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Arun Mahadevan
Heres a proposal to a add - https://github.com/apache/spark/pull/21819 Its always good to set "maxOffsetsPerTrigger" unless you want spark to process till the end of the stream in each micro batch. Even without "maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch completes.

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Burak Yavuz
If you don't set rate limiting through `maxOffsetsPerTrigger`, Structured Streaming will always process until the end of the stream. So number of records waiting to be processed should be 0 at the start of each trigger. On Mon, Jul 30, 2018 at 8:03 AM, Kailash Kalahasti <

Kafka backlog - spark structured streaming

2018-07-30 Thread Kailash Kalahasti
Is there any way to find out backlog on kafka topic while using spark structured streaming ? I checked few consumer apis but that requires to enable groupid for streaming, but seems it is not allowed. Basically i want to know number of records waiting to be processed. Any suggestions ?

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread Patrick McGloin
You could use an object in Scala, of which only one instance will be created on each JVM / Executor. E.g. object MyDatabseSingleton { var dbConn = ??? } On Sat, 28 Jul 2018, 08:34 kant kodali, wrote: > Hi All, > > I understand creating a connection forEachPartition but I am wondering can >

Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-07-30 Thread oripwk
Thanks guys, it really helps. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Using Spark Streaming for analyzing changing data

2018-07-30 Thread oripwk
We have a use case where there's a stream of events while every event has an ID and its current state with a timestamp: … 111,ready,1532949947 111,offline,1532949955 111,ongoing,1532949955 111,offline,1532949973 333,offline,1532949981 333,ongoing,1532949987 … We want to ask questions about the

How to add a new source to exsting struct streaming application, like a kafka source

2018-07-30 Thread 杨浩
How to add a new source to exsting struct streaming application, like a kafka source

How to read csv in dataframe

2018-07-30 Thread Lehak Dharmani
I am trying to read csv in spark dataframe . My Os = Ubuntu 18.04, spark-version 2.3.1, python -version 2.7.15 My code : from pyspark import SparkConf from pyspark import SparkContext from pyspark import SQLContext from pyspark.sql import SparkSession conf = SparkConf() sc = SparkContext(conf =