Hi all,
I'm doing some experiment with Spark MLlib (version 1.5.0). I train
LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of
features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a
4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @
2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM:
256GB (spark configuration: driver 100GB, spark executor 100GB).

I'm doing two experiments:
1) Load data into Hive, and use HiveContext in Spark program to load data
from Hive into an DataFrame, parse the DataFrame into a
RDD<LabeledPoint>, then train LogisticRegressionModel on this RDD.
The training time is 389218 milliseconds.
2) Load data from a Socket Server into an RDD, which have done some feature
transforming, add 5 features to each datum. So the # of features is
3231966. Then repartition this RDD into 16 partitions, and parse the RDD
into  RDD<LabeledPoint>, finally train  LogisticRegressionModel on this
RDD.
The training time is 838470 milliseconds.

The training time mentioned above is only the time of: final
LogisticRegressionModel model = new
LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd());
not included the loading time and parsing time.

So here is the question: why these two experiments' training time have such
a large difference? I suppose they should be similar but actually 2x. I
even tried to repartition the RDD into 4/32/64/128 partitions, and cache
them before training in the experiment 2, but doesn't make sense.

Is there any inner difference between the RDD used for training in the
2 experiments that  cause the difference of training time?
I will be appreciate if you can give me some guidance.

Best,
Haoyue

Reply via email to