Re: partitionBy causing OOM

2017-09-25 Thread ayan guha
Another possible option would be creating partitioned table in hive and use dynamic partitioning while inserting. This will not require spark to do explocit partition by On Tue, 26 Sep 2017 at 12:39 pm, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Amit, > > Spark keeps the

Re: partitionBy causing OOM

2017-09-25 Thread Ankur Srivastava
Hi Amit, Spark keeps the partition that it is working on in memory (and does not spill to disk even if it is running OOM). Also since you are getting OOM when using partitionBy (and not when you just use flatMap), there should be one (or few) dates on which your partition size is bigger than the

Re: partitionBy causing OOM

2017-09-25 Thread 孫澤恩
Hi, Amit, Maybe you can change this configuration spark.sql.shuffle.partitions. The default is 200 change this property could change the task number when you are using DataFrame API. > On 26 Sep 2017, at 1:25 AM, Amit Sela wrote: > > I'm trying to run a simple pyspark

Unpersist all from memory in spark 2.2

2017-09-25 Thread Cesar
Is there a way to unpersist all data frames, data sets, and/or RDD in Spark 2.2 in a single call? Thanks -- Cesar Flores

Announcing Spark on Kubernetes release 0.4.0

2017-09-25 Thread Erik Erlandson
The Spark on Kubernetes development community is pleased to announce release 0.4.0 of Apache Spark with native Kubernetes scheduler back-end! The dev community is planning to use this release as the reference for upstreaming native kubernetes capability over the Spark 2.3 release cycle. This

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-25 Thread Gokula Krishnan D
Thanks for the reply. Forgot to mention that, our Batch ETL Jobs are in Core-Spark. On Sep 22, 2017, at 3:13 PM, Vadim Semenov wrote: 1. 40s is pretty negligible unless you run your job very frequently, there can be many factors that influence that. 2. Try to

partitionBy causing OOM

2017-09-25 Thread Amit Sela
I'm trying to run a simple pyspark application that reads from file (json), flattens it (explode) and writes back to file (json) partitioned by date using DataFrameWriter.partitionBy(*cols). I keep getting OOMEs like: java.lang.OutOfMemoryError: Java heap space at

How to write dataframe to kafka topic in spark streaming application using pyspark?

2017-09-25 Thread umargeek
Can anyone provide me code snippet/ steps to write a data frame to Kafka topic in a spark streaming application using pyspark with spark 2.1.1 and Kafka 0.8 (Direct Stream Approach)? Thanks, Umar -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

hive2 query using SparkSQL seems wrong

2017-09-25 Thread Cinyoung Hur
Hi, I'm using hive 2.3.0, spark 2.1.1, and zeppelin 0.7.2. When I submit query in hive interpreter, it works fine. I could see exactly same query in zeppelin notebook and hiveserver2 web UI. However, when I submitted query using sparksql, query seemed wrong. For example, every columns are with

Re: Offline environment

2017-09-25 Thread Georg Heiler
Just build a fat jar and do not apply --packages serkan ta? schrieb am Mo. 25. Sep. 2017 um 09:24: > Hi, > > Everytime i submit spark job, checks the dependent jars from remote maven > repo. > > Is it possible to set spark first load the cached jars rather than > looking

Offline environment

2017-09-25 Thread serkan ta?
Hi, Everytime i submit spark job, checks the dependent jars from remote maven repo. Is it possible to set spark first load the cached jars rather than looking for internet connection?

RE: using R with Spark

2017-09-25 Thread Adaryl Wakefield
Yeah I saw that on my cheat sheet. It's marked as "Experimental" which was somewhat ominous. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net