Felix Cheung created SPARK-22925: ------------------------------------ Summary: ml model persistence creates a lot of small files Key: SPARK-22925 URL: https://issues.apache.org/jira/browse/SPARK-22925 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.2.1, 2.1.2, 2.3.0 Reporter: Felix Cheung
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or repartition(1) but in some other models we don't. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60 In the former case issue such as SPARK-19294 has been reported for having very large single file. Whereas in the latter case, model such as RandomForestModel could create hundreds or thousands of file which is also unmanageable. Looking into this, there is no simple way to set/change spark.default.parallelism while the app is running since SparkConf seems to be copied/cached by the backend without a way to update them. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135 It seems we need to have a way to make it settable on a per-use basis. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org