Hello Mans,
On 1 Jan 2018, at 17:12, M Singh wrote:
> I am not sure if I missed it - but can you let us know what is your input
> source and output sink ?
Reading from S3 and writing to S3.
However the never-ending task 0.0 happens in a stage way before outputting
Hello Gourav,
On 30 Dec 2017, at 20:20, Gourav Sengupta wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY
On 28 Dec 2017, at 19:25, Patrick Alwell wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the
> num executors, cores per executor, and memory per executor to be a better
> alternative.
No difference with spark.dynamicAllocation.enabled
Hello,
Just a quick update as I did not made much progress yet.
On 28 Dec 2017, at 21:09, Gourav Sengupta wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11
> instead?
Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>
On 28 Dec 2017, at 19:42, Gourav Sengupta wrote:
> In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff
behind my back).
> Are you
On 28 Dec 2017, at 19:40, Maximiliano Felice
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of
> a mix of speculative execution and OOM issues in the container.
Interesting! However I don't have any OOM exception in the logs.
On 28 Dec 2017, at 19:25, Patrick Alwell wrote:
> You are using groupByKey() have you thought of an alternative like
> aggregateByKey() or combineByKey() to reduce shuffling?
I am aware of this indeed. I do have a groupByKey() that is difficult to avoid,
but the
On 28 Dec 2017, at 17:41, Richard Qiao wrote:
> Are you able to specify which path of data filled up?
I can narrow it down to a bunch of files but it's not so straightforward.
> Any logs not rolled over?
I have to manually terminate the cluster but there is nothing
Dear Sparkers,
Once again in times of desperation, I leave what remains of my mental sanity to
this wise and knowledgeable community.
I have a Spark job (on EMR 5.8.0) which had been running daily for months, if
not the whole year, with absolutely no supervision. This changed all of sudden
On 16 Nov 2017, at 10:22, Michael Shtelma wrote:
> you call repartition(1) before starting processing your files. This
> will ensure that you end up with just one partition.
One question and one remark:
Q) val ds = sqlContext.read.parquet(path).repartition(1)
Am I
Dear Sparkers,
A while back, I asked how to process non-splittable files in parallel, one file
per executor. Vadim's suggested "scheduling within an application" approach
worked out beautifully.
I am now facing the 'opposite' problem:
- I have a bunch of parquet files to process
- Once
On 16 Oct 2017, at 16:22, Silvio Fiorito wrote:
> [...] then just infer the schema from a single file and reuse it when loading
> the whole data set:
Well, that is a possibility indeed.
Thanks,
Jeroen
Hello Spark users,
Does anyone know if there is a way to generate the Scala code for a complex
structure just from the output of dataframe.printSchema?
I have to analyse a significant volume of data and want to explicitly set the
schema(s) to avoid having to read my (compressed) JSON files
Vadim's "scheduling within an application" approach turned out to be
excellent, at least on a single node with the CPU usage reaching about
90%. I directly implemented the code template that Vadim kindly
provided:
parallel_collection_paths.foreach(
path => {
val lines =
On Fri, Sep 29, 2017 at 12:20 AM, Gourav Sengupta
wrote:
> Why are you not using JSON reader of SPARK?
Since the filter I want to perform is so simple, I do not want to
spend time and memory to deserialise the JSON lines.
Jeroen
On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke wrote:
> It looks to me a little bit strange. First json.gz files are single threaded,
> ie each file can only be processed by one thread (so it is good to have many
> files of around 128 MB to 512 MB size each).
Indeed.
More details on what I want to achieve. Maybe someone can suggest a
course of action.
My processing is extremely simple: reading .json.gz text
files, filtering each line according a regex, and saving the surviving
lines in a similarly named .gz file.
Unfortunately changing the data format is
Hello,
I am experiencing a disappointing performance issue with my Spark jobs
as I scale up the number of instances.
The task is trivial: I am loading large (compressed) text files from S3,
filtering out lines that do not match a regex, counting the numbers
of remaining lines and saving the
Dear fellow Sparkers,
I am barely dipping my toes into the Spark world and I was wondering if the
following workflow can be implemented in Spark:
1. Initialize custom data structure DS on each executor .
These data structures DS should live until the end of the
program.
Dear fellow Sparkers,
I am a new Spark user and I am trying to solve a (conceptually simple)
problem which may not be a good use case for Spark, at least for the RDD
API. But before I turn my back on it, I would rather have the opinion of
more knowledgeable developers than me, as it is highly
20 matches
Mail list logo