I am exploring utilizing Spark as my dataset is becoming more and more difficult to manage and analyze. I'd appreciate if anyone could provide feedback on the following questions for me:
* I am especially interested in training large datasets using machine learning algorithms. o Does PySpark have a Gradient Boosting Machine package that allows to user to run multiple iterations in the same command, similar to R's caret package? * Also, does anyone know of benchmarks that illustrate when Spark is most (and least) appropriate to use? o I've often hear "when your data is not manageable on one computer", but I'd appreciate more concrete comparisons if possible. o If anyone has benchmarks that consider data size, type of operation, etc. that would be extremely helpful. ? At what point does the efficiency overtake the overhead and when is substantially faster (compared to R's caret/gbm, h2o, python, etc)? Thanks so much