Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Makoto Yui
Xiangrui, (2014/06/19 23:43), Xiangrui Meng wrote: It is because the frame size is not set correctly in executor backend. see spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate? Not yet. I will wait the v1.0.1 release. Thanks, Makoto

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Xiangrui Meng
It is because the frame size is not set correctly in executor backend. see spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate? > On Jun 19, 2014, at 2:01 AM, Makoto Yui wrote: > > Xiangrui and Debasish, > > (2014/06/18 6:33), Debasish Das wrote: >> I did run pretty b

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-19 Thread Makoto Yui
Xiangrui and Debasish, (2014/06/18 6:33), Debasish Das wrote: I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10 executors each with 16 GB memory... I could figure out what the problem is. "spark.akka.frameSize" was to

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 8:49), Xiangrui Meng wrote: Makoto, dense vectors are used to in aggregation. If you have 32 partitions and each one sending a dense vector of size 1,354,731 to master. Then the driver needs 300M+. That may be the problem. It seems that it could cuase certain problems

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Makoto, please use --driver-memory 8G when you launch spark-shell. -Xiangrui On Tue, Jun 17, 2014 at 4:49 PM, Xiangrui Meng wrote: > DB, Yes, reduce and aggregate are linear. > > Makoto, dense vectors are used to in aggregation. If you have 32 > partitions and each one sending a dense vector of s

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
DB, Yes, reduce and aggregate are linear. Makoto, dense vectors are used to in aggregation. If you have 32 partitions and each one sending a dense vector of size 1,354,731 to master. Then the driver needs 300M+. That may be the problem. Which deploy mode are you using, standalone or local? Debasi

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Debasish Das
Xiangrui, Could you point to the JIRA related to tree aggregate ? ...sounds like the allreduce idea... I would definitely like to try it on our dataset... Makoto, I did run pretty big sparse dataset (20M rows, 3M sparse features) and I got 100 iterations of SGD running in 200 seconds...10 execu

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 6:03), Xiangrui Meng wrote: Are you using Spark 1.0 or 0.9? Could you go to the executor tab of the web UI and check the driver's memory? I am using Spark 1.0. 588.8 MB is allocated for RDDs. I am setting SPARK_DRIVER_MEMORY=2g in the conf/spark-env.sh. The value al

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, Does it mean that mapPartition and then reduce shares the same behavior as aggregate operation which is O(n)? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17, 201

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Makoto, Are you using Spark 1.0 or 0.9? Could you go to the executor tab of the web UI and check the driver's memory? treeAggregate is not part of 1.0. Best, Xiangrui On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng wrote: > Hi DB, > > treeReduce (treeAggregate) is a feature I'm testing now.

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi DB, treeReduce (treeAggregate) is a feature I'm testing now. It is a compromise between current reduce and butterfly allReduce. The former runs in linear time on the number of partitions, the latter introduces too many dependencies. treeAggregate with depth = 2 should run in O(sqrt(n)) time, wh

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hi Xiangrui, (2014/06/18 4:58), Xiangrui Meng wrote: How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms. The training data "news20.random.1000" is small and thus only 2 partitions are used by the default. val training = MLUti

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Xiangrui Meng
Hi Makoto, How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms. Btw, could you try the tree branch in my repo? https://github.com/mengxr/spark/tree/tree I used tree aggregate in this branch. It should help with the scalability. Be

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Here is follow-up to the previous evaluation. "aggregate at GradientDescent.scala:178" never finishes at https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178 We confirmed, by -verbose:gc, that GC is not happening during th

news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread Makoto Yui
Hello, I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though the number of training examples used in the evaluation is just 1,000. It works fine for the dataset *news20.binary.1000* that has 178,560 features. H