Hi all, I'm doing some experiment with Spark MLlib (version 1.5.0). I train LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a 4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM: 256GB (spark configuration: driver 100GB, spark executor 100GB).
I'm doing two experiments: 1) Load data into Hive, and use HiveContext in Spark program to load data from Hive into an DataFrame, parse the DataFrame into a RDD<LabeledPoint>, then train LogisticRegressionModel on this RDD. The training time is 389218 milliseconds. 2) Load data from a Socket Server into an RDD, which have done some feature transforming, add 5 features to each datum. So the # of features is 3231966. Then repartition this RDD into 16 partitions, and parse the RDD into RDD<LabeledPoint>, finally train LogisticRegressionModel on this RDD. The training time is 838470 milliseconds. The training time mentioned above is only the time of: final LogisticRegressionModel model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd()); not included the loading time and parsing time. So here is the question: why these two experiments' training time have such a large difference? I suppose they should be similar but actually 2x. I even tried to repartition the RDD into 4/32/64/128 partitions, and cache them before training in the experiment 2, but doesn't make sense. Is there any inner difference between the RDD used for training in the 2 experiments that cause the difference of training time? I will be appreciate if you can give me some guidance. Best, Haoyue