Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Sebastián Ramírez
Awesome! Thanks! *Sebastián Ramírez* Head of Software Development http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email

PySpark: Python 2.7 cluster installation script (with Numpy, IPython, etc)

2015-03-11 Thread Sebastián Ramírez
a *simple script which helps installing Anaconda Python in the machines of a cluster *more easily. I wanted to share it here, in case it can help someone wanting using PySpark. https://github.com/tiangolo/anaconda_cluster_install *Sebastián Ramírez* Head of Software Development http

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-24 Thread Sebastián Ramírez
Great to know, thanks Xiangrui. *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo

Re: Is Ubuntu server or desktop better for spark cluster

2015-02-24 Thread Sebastián Ramírez
a terminal Ctrl+Alt+F1 # Shutdown the GUI sudo stop lightdm (for reference: http://askubuntu.com/questions/148321/how-do-i-stop-gui) *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Sebastián Ramírez
in pseudo-code that you can save to a file. Then, you can parse that pseudo code to write a proper script that runs the Decision Tree. Actually, that's what I did for a Random Forest (an ensamble of Decision Trees). Hope that helps, *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com

Re: Anaconda iPython notebook working with CDH Spark

2014-12-30 Thread Sebastián Ramírez
/Anaconda-2.1.0-Linux-x86_64.sh # Or the current link for the moment you are doing it: https://store.continuum.io/cshop/anaconda/ bash Anaconda*.sh # When asked if set it as the default Python, or to add Anaconda to the PATH (I don't remember how they say it), choose yes I hope that helps, *Sebastián

Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread Sebastián Ramírez
that helps. Best, *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email: sebastian.rami

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sebastián Ramírez
, and aren't applied until they are needed by an action (and, to me, it happend for readings too some time ago). You can try calling a .first() in your RDD from once in a while to force it to load the RDD to your cluster (but it might not be the cleanest way to do it). *Sebastián Ramírez* Diseñador