you for your response Ewan. I quickly looked yesterday and it
was there, but today at work I tried to open it again to start working
on it, but it appears to be removed. Is this correct?
Thanks,
Tom
On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be
mailto:ewan.hi...@ugent.be wrote
Hi all.
The code is linked from my repo:
https://github.com/ehiggs/spark-terasort
This is an example Spark program for running TeraSort benchmarks. It is
based on work from Reynold Xin's branch
https://github.com/rxin/spark/tree/terasort, but it is not the same
TeraSort program that
threads per executor on
some tasks but 32meg for stack thread overhead should do. Maybe the
issue is sockets or some mem leak of network communication.
On 13/07/15 09:15, Ewan Higgs wrote:
It depends on how large the xml files are and how you're processing them.
If you're using !ENTITY tags
Hi Gavin,
You can increase the speed by choosing a better encoding. A little bit
of ETL goes a long way.
e.g. As you're working with Spark SQL you probably have a tabular
format. So you could use CSV so you don't need to parse the field names
on each entry (and it will also reduce the file
Jonathan,
Did you ever get to the bottom of this? I have some users working with
Spark in a classroom setting and our example notebooks run into problems
where there is so much spilled to disk that they run out of quota. A
1.5G input set becomes >30G of spilled data on disk. I looked into how
e.
On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs <ewan.hi...@ugent.be> wrote:
Jonathan,
Did you ever get to the bottom of this? I have some users working with Spark
in a classroom setting and our example notebooks run into problems where
there is so much spilled to disk that they run out of quo
Hi all,
We are running a class with Pyspark notebook for data analysis. Some of
the books are fairly long and have a lot of operations. Through the
course of the notebook, the shuffle storage expands considerably and
often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle
files). Closing
Hi all,
Back in september there was a bunch of machine learning profile results
published here:
https://github.com/szilard/benchm-ml/
Spark's Random Forest seemed to fall down with memory issues at about
10m entries:
https://github.com/szilard/benchm-ml/blob/master/2-rf/5c-spark-crash.txt