Hi Haoyue, Could you find the time spent on each stage of the LinearRegression model training at the Spark UI? It can tell us which stage is the most time-consuming and help us to analyze the cause.
Yanbo 2015-12-05 15:14 GMT+08:00 Haoyue Wang <whymoon...@gmail.com>: > Hi all, > I'm doing some experiment with Spark MLlib (version 1.5.0). I train > LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of > features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a > 4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ > 2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM: > 256GB (spark configuration: driver 100GB, spark executor 100GB). > > I'm doing two experiments: > 1) Load data into Hive, and use HiveContext in Spark program to load data > from Hive into an DataFrame, parse the DataFrame into a > RDD<LabeledPoint>, then train LogisticRegressionModel on this RDD. > The training time is 389218 milliseconds. > 2) Load data from a Socket Server into an RDD, which have done some > feature transforming, add 5 features to each datum. So the # of features is > 3231966. Then repartition this RDD into 16 partitions, and parse the RDD > into RDD<LabeledPoint>, finally train LogisticRegressionModel on this > RDD. > The training time is 838470 milliseconds. > > The training time mentioned above is only the time of: final > LogisticRegressionModel model = new > LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd()); > not included the loading time and parsing time. > > So here is the question: why these two experiments' training time have > such a large difference? I suppose they should be similar but actually 2x. > I even tried to repartition the RDD into 4/32/64/128 partitions, and cache > them before training in the experiment 2, but doesn't make sense. > > Is there any inner difference between the RDD used for training in the > 2 experiments that cause the difference of training time? > I will be appreciate if you can give me some guidance. > > Best, > Haoyue >