Hi team, We are testing the performance and capability of Spark for Linear regression application to replace at least sklearn linear regression. We firstly generated data for model fitting via sklearn.dataset.make_regression. See the generation code <https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-datageneration-py> here.
Then, we performed sklearn(Python)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-py> and Spark(scala)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-scala> for model fitting. We setup our models according to the post in stackoverflow<https://stackoverflow.com/questions/42729431/spark-ml-regressions-do-not-calculate-same-models-as-scikit-learn> to try to get same model in both sides. The Spark setting is: l Spark3.1 l Spark-shell -I l Spark Standalone with 3 exectors(8 cores;128GB per each) l Driver: spark.driver.maxResultSize=20g; spark.driver.memory=100g; spark.executor.memory=128g Fitting speed: Python:19S Spark: 600s+ Resource Utilization: Python: 150GB Spark: Node1(50GB+);Node2,3 and Driver node(2GB) Got stuck on MapPartitionsRDD [23] treeAggregate at WeightedLeastSquares.scala:107 What we tried: Ø maxTreeDepth Ø repartion Ø standerlization None of them have significent effect on Trainning speed. Could you please help to figure out where the issue comes from? 20X slower is not acceptable for us despite Spark has other good features. Thanks! You Hu