Hi Haoyue,

Could you find the time spent on each stage of the LinearRegression model
training at the Spark UI?
It can tell us which stage is the most time-consuming and help us to
analyze the cause.

Yanbo

2015-12-05 15:14 GMT+08:00 Haoyue Wang <whymoon...@gmail.com>:

> Hi all,
> I'm doing some experiment with Spark MLlib (version 1.5.0). I train
> LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of
> features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a
> 4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @
> 2.00GHz, 2 #CPUs * 8 #cores *2 #threads; Network: 40Gbps infiniband; RAM:
> 256GB (spark configuration: driver 100GB, spark executor 100GB).
>
> I'm doing two experiments:
> 1) Load data into Hive, and use HiveContext in Spark program to load data
> from Hive into an DataFrame, parse the DataFrame into a
> RDD<LabeledPoint>, then train LogisticRegressionModel on this RDD.
> The training time is 389218 milliseconds.
> 2) Load data from a Socket Server into an RDD, which have done some
> feature transforming, add 5 features to each datum. So the # of features is
> 3231966. Then repartition this RDD into 16 partitions, and parse the RDD
> into  RDD<LabeledPoint>, finally train  LogisticRegressionModel on this
> RDD.
> The training time is 838470 milliseconds.
>
> The training time mentioned above is only the time of: final
> LogisticRegressionModel model = new
> LogisticRegressionWithLBFGS().setNumClasses(2).run(training.rdd());
> not included the loading time and parsing time.
>
> So here is the question: why these two experiments' training time have
> such a large difference? I suppose they should be similar but actually 2x.
> I even tried to repartition the RDD into 4/32/64/128 partitions, and cache
> them before training in the experiment 2, but doesn't make sense.
>
> Is there any inner difference between the RDD used for training in the
> 2 experiments that  cause the difference of training time?
> I will be appreciate if you can give me some guidance.
>
> Best,
> Haoyue
>

Reply via email to