pyspark aggregate vectors from onehotencoder

2016-07-04 Thread Sebastian Kuepers
hey, what is best practice to aggregate the vectors from onehotencoders in pyspark? udafs are still not available in python. is there any way to do it with spark sql? or do you have to switch to rdds and do it with a reduceByKey for example? thanks, sebastian

ChiSqSelector Selected Features Indicies

2016-06-08 Thread Sebastian Kuepers
Hi there, what is the best way to get from: pyspark.mllib.feature.ChiSqSelector(numTopFeatures) the vector indices of the selected vectors from the original input vector? Shouldn't the model contain this information? Thanks!

Correlation Matrix Limits

2016-03-10 Thread Sebastian Kuepers
Hello, I am planning to use from the pyspark.mllib.stat package the corr() function to compute a correlation matrix. Will this happen in a distributed fashion and does it scale up well, if you have Vectors with a length of over a million columns? Thanks, Sebastian

collect() local faster than 4 node cluster

2015-11-03 Thread Sebastian Kuepers
Hey, with collect() RDDs elements are send as a list back to the driver. If have a 4 node cluster (based on Mesos) in a datacenter and I have my local dev machine. I work with a small 200MB dataset just for testing during development right now. The collect() tasks are running for times

Re: collect() local faster than 4 node cluster

2015-11-03 Thread Sebastian Kuepers
I could actually figure out, that it had to do with the Mesos Run Mode of Spark. Setting spark.mesos.coarse to true made all the difference. So the primary performance bummer was actually the fine-grained mode and therefore Mesos overhead. Thanks! Sebastian 2015-11-03 20:07 GMT+01:00 Sebastian

Spark, Mesos problems with remote connections

2015-11-02 Thread Sebastian Kuepers
Hey, I have a Mesos cluster with a single Master. If I run the following directly on the master machine: pyspark --master mesos://host:5050 everything works just fine. If I try to connect from to the master starting a driver from my laptop everything stops after the following log output

Save RandomForest Model from ML package

2015-10-22 Thread Sebastian Kuepers
Hey, I try to figure out the best practice on saving and loading models which have bin fitted with the ML package - i.e. with the RandomForest classifier. There is PMML support in the MLib package afaik but not in ML - is that correct? How do you approach this, so that you do not have to fit

Model summary for linear and logistic regression.

2015-09-11 Thread Sebastian Kuepers
Hey, the 1.5.0 release note say, that there are now model summaries for logistic regressions available. But I can't find them in the current documentary. ? Any help very much appreciated! Thanks Sebastian