Xiangrui,
(2014/06/19 23:43), Xiangrui Meng wrote:
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
Not yet. I will wait the v1.0.1 release.
Thanks,
Makoto
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
> On Jun 19, 2014, at 2:01 AM, Makoto Yui wrote:
>
> Xiangrui and Debasish,
>
> (2014/06/18 6:33), Debasish Das wrote:
>> I did run pretty b
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10 executors each
with 16 GB memory...
I could figure out what the problem is. "spark.akka.frameSize" was to
Hi Xiangrui,
(2014/06/18 8:49), Xiangrui Meng wrote:
Makoto, dense vectors are used to in aggregation. If you have 32
partitions and each one sending a dense vector of size 1,354,731 to
master. Then the driver needs 300M+. That may be the problem.
It seems that it could cuase certain problems
Makoto, please use --driver-memory 8G when you launch spark-shell. -Xiangrui
On Tue, Jun 17, 2014 at 4:49 PM, Xiangrui Meng wrote:
> DB, Yes, reduce and aggregate are linear.
>
> Makoto, dense vectors are used to in aggregation. If you have 32
> partitions and each one sending a dense vector of s
DB, Yes, reduce and aggregate are linear.
Makoto, dense vectors are used to in aggregation. If you have 32
partitions and each one sending a dense vector of size 1,354,731 to
master. Then the driver needs 300M+. That may be the problem. Which
deploy mode are you using, standalone or local?
Debasi
Xiangrui,
Could you point to the JIRA related to tree aggregate ? ...sounds like the
allreduce idea...
I would definitely like to try it on our dataset...
Makoto,
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10 execu
Hi Xiangrui,
(2014/06/18 6:03), Xiangrui Meng wrote:
Are you using Spark 1.0 or 0.9? Could you go to the executor tab of
the web UI and check the driver's memory?
I am using Spark 1.0.
588.8 MB is allocated for RDDs.
I am setting SPARK_DRIVER_MEMORY=2g in the conf/spark-env.sh.
The value al
Hi Xiangrui,
Does it mean that mapPartition and then reduce shares the same
behavior as aggregate operation which is O(n)?
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Tue, Jun 17, 201
Hi Makoto,
Are you using Spark 1.0 or 0.9? Could you go to the executor tab of
the web UI and check the driver's memory?
treeAggregate is not part of 1.0.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng wrote:
> Hi DB,
>
> treeReduce (treeAggregate) is a feature I'm testing now.
Hi DB,
treeReduce (treeAggregate) is a feature I'm testing now. It is a
compromise between current reduce and butterfly allReduce. The former
runs in linear time on the number of partitions, the latter introduces
too many dependencies. treeAggregate with depth = 2 should run in
O(sqrt(n)) time, wh
Hi Xiangrui,
(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
The training data "news20.random.1000" is small and thus only 2
partitions are used by the default.
val training = MLUti
Hi Xiangrui,
What's different between treeAggregate and aggregate? Why
treeAggregate scales better? What if we just use mapPartition, will it
be as fast as treeAggregate?
Thanks.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn:
Hi Makoto,
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree
I used tree aggregate in this branch. It should help with the scalability.
Be
Here is follow-up to the previous evaluation.
"aggregate at GradientDescent.scala:178" never finishes at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178
We confirmed, by -verbose:gc, that GC is not happening during th
Hello,
I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
the number of training examples used in the evaluation is just 1,000.
It works fine for the dataset *news20.binary.1000* that has 178,560
features. H
16 matches
Mail list logo