Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Mans, On 1 Jan 2018, at 17:12, M Singh wrote: > I am not sure if I missed it - but can you let us know what is your input > source and output sink ? Reading from S3 and writing to S3. However the never-ending task 0.0 happens in a stage way before outputting

Re: Spark on EMR suddenly stalling

2018-01-01 Thread Jeroen Miller
Hello Gourav, On 30 Dec 2017, at 20:20, Gourav Sengupta wrote: > Please try to use the SPARK UI from the way that AWS EMR recommends, it > should be available from the resource manager. I never ever had any problem > working with it. THAT HAS ALWAYS BEEN MY PRIMARY

Re: Spark on EMR suddenly stalling

2017-12-29 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell wrote: > Dynamic allocation is great; but sometimes I’ve found explicitly setting the > num executors, cores per executor, and memory per executor to be a better > alternative. No difference with spark.dynamicAllocation.enabled

Fwd: Spark on EMR suddenly stalling

2017-12-29 Thread Jeroen Miller
Hello, Just a quick update as I did not made much progress yet. On 28 Dec 2017, at 21:09, Gourav Sengupta wrote: > can you try to then use the EMR version 5.10 instead or EMR version 5.11 > instead? Same issue with EMR 5.11.0. Task 0 in one stage never finishes. >

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:42, Gourav Sengupta wrote: > In the EMR cluster what are the other applications that you have enabled > (like HIVE, FLUME, Livy, etc). Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff behind my back). > Are you

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:40, Maximiliano Felice wrote: > I experienced a similar issue a few weeks ago. The situation was a result of > a mix of speculative execution and OOM issues in the container. Interesting! However I don't have any OOM exception in the logs.

Fwd: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell wrote: > You are using groupByKey() have you thought of an alternative like > aggregateByKey() or combineByKey() to reduce shuffling? I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, but the

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 17:41, Richard Qiao wrote: > Are you able to specify which path of data filled up? I can narrow it down to a bunch of files but it's not so straightforward. > Any logs not rolled over? I have to manually terminate the cluster but there is nothing

Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
Dear Sparkers, Once again in times of desperation, I leave what remains of my mental sanity to this wise and knowledgeable community. I have a Spark job (on EMR 5.8.0) which had been running daily for months, if not the whole year, with absolutely no supervision. This changed all of sudden

Re: Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
On 16 Nov 2017, at 10:22, Michael Shtelma wrote: > you call repartition(1) before starting processing your files. This > will ensure that you end up with just one partition. One question and one remark: Q) val ds = sqlContext.read.parquet(path).repartition(1) Am I

Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
Dear Sparkers, A while back, I asked how to process non-splittable files in parallel, one file per executor. Vadim's suggested "scheduling within an application" approach worked out beautifully. I am now facing the 'opposite' problem: - I have a bunch of parquet files to process - Once

Re: Generating StructType from dataframe.printSchema

2017-10-16 Thread Jeroen Miller
On 16 Oct 2017, at 16:22, Silvio Fiorito wrote: > [...] then just infer the schema from a single file and reuse it when loading > the whole data set: Well, that is a possibility indeed. Thanks, Jeroen

Generating StructType from dataframe.printSchema

2017-10-16 Thread Jeroen Miller
Hello Spark users, Does anyone know if there is a way to generate the Scala code for a complex structure just from the output of dataframe.printSchema? I have to analyse a significant volume of data and want to explicitly set the schema(s) to avoid having to read my (compressed) JSON files

Re: More instances = slower Spark job

2017-10-01 Thread Jeroen Miller
Vadim's "scheduling within an application" approach turned out to be excellent, at least on a single node with the CPU usage reaching about 90%. I directly implemented the code template that Vadim kindly provided: parallel_collection_paths.foreach( path => { val lines =

Re: More instances = slower Spark job

2017-10-01 Thread Jeroen Miller
On Fri, Sep 29, 2017 at 12:20 AM, Gourav Sengupta wrote: > Why are you not using JSON reader of SPARK? Since the filter I want to perform is so simple, I do not want to spend time and memory to deserialise the JSON lines. Jeroen

Re: More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke wrote: > It looks to me a little bit strange. First json.gz files are single threaded, > ie each file can only be processed by one thread (so it is good to have many > files of around 128 MB to 512 MB size each). Indeed.

Re: More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
More details on what I want to achieve. Maybe someone can suggest a course of action. My processing is extremely simple: reading .json.gz text files, filtering each line according a regex, and saving the surviving lines in a similarly named .gz file. Unfortunately changing the data format is

More instances = slower Spark job

2017-09-28 Thread Jeroen Miller
Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances. The task is trivial: I am loading large (compressed) text files from S3, filtering out lines that do not match a regex, counting the numbers of remaining lines and saving the

Computing on each partition/executor with "persistent" data

2016-06-13 Thread Jeroen Miller
Dear fellow Sparkers, I am barely dipping my toes into the Spark world and I was wondering if the ​ ​ following workflow can be implemented in Spark: 1. Initialize custom data structure DS on each executor . These data structures DS should live until the end of the program.

Sequential computation over several partitions

2016-06-07 Thread Jeroen Miller
Dear fellow Sparkers, I am a new Spark user and I am trying to solve a (conceptually simple) problem which may not be a good use case for Spark, at least for the RDD API. But before I turn my back on it, I would rather have the opinion of more knowledgeable developers than me, as it is highly