Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.

PR: https://github.com/apache/spark/pull/1110

-Xiangrui

On Tue, Jul 1, 2014 at 12:55 AM, Charles Li <littlee1...@gmail.com> wrote:
> Hi Spark,
>
> I am running LBFGS on our user data. The data size with Kryo serialisation is 
> about 210G. The weight size is around 1,300,000. I am quite confused that the 
> performance is very close whether the data is cached or not.
>
> The program is simple:
> points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
> points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
> gradient = new LogisticGrandient();
> updater = new SquaredL2Updater();
> initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
> result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
> convergeTol, maxIter, regParam, initWeight);
>
> I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
> cluster mode. Below are some arguments I am using:
> —executor-memory 10G
> —num-executors 50
> —executor-cores 2
>
> Storage Using:
> When caching:
> Cached Partitions 951
> Fraction Cached 100%
> Size in Memory 215.7GB
> Size in Tachyon 0.0B
> Size on Disk 1029.7MB
>
> The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
> of disk IOs can be seen on the hadoop node. I have the same result with cache 
> disabled.
>
> Should data points caching improve the performance? Should caching decrease 
> the disk IO?
>
> Thanks in advance.

Reply via email to