I am exploring utilizing Spark as my dataset is becoming more and more 
difficult to manage and analyze. I'd appreciate if anyone could provide 
feedback on the following questions for me:


*         I am especially interested in training large datasets using machine 
learning algorithms.

o   Does PySpark have a Gradient Boosting Machine package that allows to user 
to run multiple iterations in the same command, similar to R's caret package?

*         Also, does anyone know of benchmarks that illustrate when Spark is 
most (and least) appropriate to use?

o   I've often hear "when your data is not manageable on one computer", but I'd 
appreciate more concrete comparisons if possible.

o   If anyone has benchmarks that consider data size, type of operation, etc. 
that would be extremely helpful.

?  At what point does the efficiency overtake the overhead and when is 
substantially faster (compared to R's caret/gbm, h2o, python, etc)?

Thanks so much

Reply via email to