On Tue, Jul 15, 2014 at 10:48 PM, Makoto Yui yuin...@gmail.com wrote:
Hello,
(2014/06/19 23:43), Xiangrui Meng wrote:
The execution was slow for more large KDD cup 2012, Track 2 dataset
(235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the
sequential aggregation of dense
Hi Xiangrui,
(2014/07/16 15:05), Xiangrui Meng wrote:
I don't remember I wrote that but thanks for bringing this issue up!
There are two important settings to check: 1) driver memory (you can
see it from the executor tab), 2) number of partitions (try to use
small number of partitions). I put
Hello,
(2014/06/19 23:43), Xiangrui Meng wrote:
The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+
records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential
aggregation of dense vectors on a single driver node.
It took about 7.6m for aggregation
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10 executors each
with 16 GB memory...
I could figure out what the problem is. spark.akka.frameSize was too
Xiangrui,
(2014/06/19 23:43), Xiangrui Meng wrote:
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
Not yet. I will wait the v1.0.1 release.
Thanks,
Makoto
Hello,
I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
the number of training examples used in the evaluation is just 1,000.
It works fine for the dataset *news20.binary.1000* that has 178,560
features.
)
---
Thanks,
Makoto
2014-06-17 21:32 GMT+09:00 Makoto Yui yuin...@gmail.com:
Hello,
I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
the number
Hi Xiangrui,
(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
The training data news20.random.1000 is small and thus only 2
partitions are used by the default.
val training =
Hi Xiangrui,
(2014/06/18 6:03), Xiangrui Meng wrote:
Are you using Spark 1.0 or 0.9? Could you go to the executor tab of
the web UI and check the driver's memory?
I am using Spark 1.0.
588.8 MB is allocated for driver RDDs.
I am setting SPARK_DRIVER_MEMORY=2g in the conf/spark-env.sh.
The
Hi Xiangrui,
(2014/06/18 8:49), Xiangrui Meng wrote:
Makoto, dense vectors are used to in aggregation. If you have 32
partitions and each one sending a dense vector of size 1,354,731 to
master. Then the driver needs 300M+. That may be the problem.
It seems that it could cuase certain problems
10 matches
Mail list logo