Sampling data on RDD vs sampling data on Dataframes

2017-05-21 Thread Marco Didonna
Hello, me and my team have developed a fairly large big data application using only the dataframe api (Spark 1.6.3). Since our application uses machine learning to do prediction we need to sample the train dataset in order not to have skewed data. To achieve such objective we use stratified

Ipython notebook, ec2 spark cluster and matplotlib

2015-07-10 Thread Marco Didonna
Hello everybody, I'm running a two node spark cluster on ec2, created using the provided scripts. I then ssh into the master and invoke PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --profile=pyspark' spark/bin/pyspark. This launches a spark notebook which has been instructed

Spark MOOC by Berkeley and Databricks

2014-12-03 Thread Marco Didonna
Hello everybody, in case you missed DataBricks and Berkeley have announced a free mooc on spark and another one on scalable machine learning using spark. Both courses are free but if you want to have a verified certificate of completion you need to donate at least 50$. I did it, it's a great