Spark Random Forest Memory issues

2016-02-19 Thread Ewan Higgs
Hi all, Back in september there was a bunch of machine learning profile results published here: https://github.com/szilard/benchm-ml/ Spark's Random Forest seemed to fall down with memory issues at about 10m entries: https://github.com/szilard/benchm-ml/blob/master/2-rf/5c-spark-crash.txt

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-08 Thread Ewan Higgs
e. On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs <ewan.hi...@ugent.be> wrote: Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quo

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs
Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quota. A 1.5G input set becomes >30G of spilled data on disk. I looked into how

Re: Checkpointing not removing shuffle files from local disk

2015-12-03 Thread Ewan Higgs
Hi all, We are running a class with Pyspark notebook for data analysis. Some of the books are fairly long and have a lot of operations. Through the course of the notebook, the shuffle storage expands considerably and often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle files). Closing

Re: How to increase the Json parsing speed

2015-08-28 Thread Ewan Higgs
Hi Gavin, You can increase the speed by choosing a better encoding. A little bit of ETL goes a long way. e.g. As you're working with Spark SQL you probably have a tabular format. So you could use CSV so you don't need to parse the field names on each entry (and it will also reduce the file

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Ewan Higgs
threads per executor on some tasks but 32meg for stack thread overhead should do. Maybe the issue is sockets or some mem leak of network communication. On 13/07/15 09:15, Ewan Higgs wrote: It depends on how large the xml files are and how you're processing them. If you're using !ENTITY tags

Re: Spark TeraSort source request

2015-04-13 Thread Ewan Higgs
you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs ewan.hi...@ugent.be mailto:ewan.hi...@ugent.be wrote

Re: Spark TeraSort source request

2015-04-12 Thread Ewan Higgs
Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch https://github.com/rxin/spark/tree/terasort, but it is not the same TeraSort program that