I don't think MLlib supports model serialization/deserialization. You
got the error because the constructor is private. I created a JIRA for
this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to
make sure it is implemented in v1.1. For now, you can modify the
KMeansModel and remove
crater, was the error message the same as what you posted before:
14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Akka client disassociated
14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on
, 2014 at 10:48 PM, Makoto Yui yuin...@gmail.com wrote:
Hello,
(2014/06/19 23:43), Xiangrui Meng wrote:
The execution was slow for more large KDD cup 2012, Track 2 dataset
(235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the
sequential aggregation of dense vectors
Then it may be a new issue. Do you mind creating a JIRA to track this
issue? It would be great if you can help locate the line in
BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui
On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote:
I don't really have my code,
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
Hi all,
I am encountering the following error:
INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
driver
application
Chris
On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com
wrote:
Hi all,
I am
...
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
H#x2F;P : +60 18 2040 017
E-Mail : tp025...@ex.apiit.edu.my
amin_...@me.com
On Thursday, July 17, 2014 11:57 AM, Xiangrui Meng men...@gmail.com wrote:
kmeans.py contains a naive
1) This is a miss, unfortunately ... We will add support for
regularization and intercept in the coming v1.1. (JIRA:
https://issues.apache.org/jira/browse/SPARK-2550)
2) It has overflow problems in Python but not in Scala. We can
stabilize the computation by ensuring exp only takes a negative
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at
to less than 200
seconds from over 700 seconds.
I am not sure how to repartition the data to match the CPU cores. How do I
do it?
Thank you.
Ravi
On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
Nick's suggestion is a good approach for your data. The item factors to
broadcast should be a few MBs. -Xiangrui
On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote:
And you might want to apply clustering before. It is likely that every user
and every item are not
You can save RDDs to text files using RDD.saveAsTextFile and load it back using
sc.textFile. But make sure the record to string conversion is correctly
implemented if the type is not primitive and you have the parser to load them
back. -Xiangrui
On Jul 18, 2014, at 8:39 AM, Roch Denis
It was because of the latest change to task serialization:
https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a
The task size is no longer limited by akka.frameSize but we show
warning messages if the task size is above 100KB. Please check the
objects referenced in the
This is a useful feature but it may be hard to have it in v1.1 due to
limited time. Hopefully, we can support it in v1.2. -Xiangrui
On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote:
It seems MLlib right now doesn't support weighted training, training samples
have
You can also try a different region. I tested us-west-2 yesterday, and
it worked well. -Xiangrui
On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Actually the script in the master branch is also broken (it's pointing to an
older AMI). Try 1.0.1 for launching
This is a known issue:
https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working
on it. -Xiangrui
On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote:
So this is a bug unsolved (for java) yet?
From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 18 July 2014 4:52
patch, the evaluation has been
processed successfully.
I expect that these patches are merged in the next major release (v1.1?).
Without them, it would be hard to use mllib for a large dataset.
Thanks,
Makoto
(2014/07/16 15:05), Xiangrui Meng wrote:
Hi Makoto,
I don't remember I wrote
try `sbt/sbt clean` first? -Xiangrui
On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote:
I am trying to build spark after cloning from github repo:
I am executing:
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
I am getting following error:
[warn]
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is
a maintenance release with bug fixes across several areas of Spark,
including Spark Core, PySpark, MLlib, Streaming, and GraphX. We
recommend all 0.9.x users to upgrade to this stable release.
Contributions to this release came
It is ALS with setNonnegative. -Xiangrui
On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote:
Hi,
Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the
a shutdown
On Jul 2, 2014, at 0:08, Xiangrui Meng men...@gmail.com wrote:
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110
-Xiangrui
On Tue, Jul 1, 2014 at 12:55 AM, Charles Li
I think this is nice to have. Feel free to create a JIRA for it and it
would be great if you can send a PR. Thanks! -Xiangrui
On Thu, Jul 24, 2014 at 12:39 PM, SK skrishna...@gmail.com wrote:
Hi,
The mllib.clustering.kmeans implementation supports a random or parallel
initialization mode to
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.
2.
Great! Thanks for testing the new features! -Xiangrui
On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote:
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui
On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote:
Hi,
In order to evaluate the ML classification accuracy, I am
That looks good to me since there is no Hadoop InputFormat for HDF5.
But remember to specify the number of partitions in sc.parallelize to
use all the nodes. You can change `process` to `read` which yields
records one-by-one. Then sc.parallelize(files,
numPartitions).flatMap(read) returns an RDD
Do you mind sharing more details, for example, specs of nodes and data
size? -Xiangrui
2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com:
Hi all,
There is a problem we can’t resolve. We implement the OWLQN algorithm in
parallel with SPARK,
We don’t know why It is very slow in every
Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui
On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote:
Development is really rapid
Could you share more details about the dataset and the algorithm? For
example, if the dataset has 10M+ features, it may be slow for the driver to
collect the weights from executors (just a blind guess). -Xiangrui
On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote:
Hi, all
The weight vector is usually dense and if you have many partitions,
the driver may slow down. You can also take a look at the driver
memory inside the Executor tab in WebUI. Another setting to check is
the HDFS block size and whether the input data is evenly distributed
to the executors. Are the
the total task reduce from 200 t0 64 after first stage just like this:
But I don't know if this is reasonable.
On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng men...@gmail.com wrote:
After you load the data in, call `.repartition(number of
executors).cache()`. If the data is evenly
Some extra work is needed to close the loop. One related example is
streaming linear regression added by Jeremy very recently:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
You can use a model trained offline
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver?
I think the problem is with the feature dimension. KDD data has more than
20M features and in v1.0.1, the driver collects the partial gradients one
by one, sums them up, does the update, and then sends the new weights back
It is used in data loading:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76
On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote:
I followed the example in
Then this may be a bug. Do you mind sharing the dataset that we can
use to reproduce the problem? -Xiangrui
On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote:
Spark 1.0.1
thanks
--
View this message in context:
Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote:
Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
6 worker
ratings.map{ case Rating(u,m,r) = {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
The code doesn't work because the userFeatures and productFeatures
stored in the model are RDDs. You tried to serialize them into the
task closure, and execute `model.predict` on an
Besides durin's suggestion, please also confirm driver and executor
memory in the WebUI, since they are small according to the log:
14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)
-Xiangrui
Is the GPL library only available on the driver node? If that is the
case, you need to add them to `--jars` option of spark-submit.
-Xiangrui
On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei hangel...@gmail.com wrote:
I had the following error when trying to run a very simple spark job (which
uses
They are two different RDDs. Spark doesn't guarantee that the first
partition of RDD1 and the first partition of RDD2 will stay in the
same worker node. If that is the case, if you have 1000
single-partition RDDs the first worker will have very heavy load.
-Xiangrui
On Thu, Aug 7, 2014 at 2:20
Thanks for posting the solution! You can also append `% provided` to
the `spark-mllib` dependency line and remove `spark-core` (because
spark-mllib already depends on spark-core) to make the assembly jar
smaller. -Xiangrui
On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote:
i was
For the reduce/aggregate question, driver collects results in
sequence. We now use tree aggregation in MLlib to reduce driver's
load:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89
It is faster than aggregate when there are many
If the file is big enough, you can try MLUtils.loadLibSVMFile with a
minPartitions argument. This doesn't shuffle data but it might not
give you the exact number of partitions. If you want to have the exact
number, use RDD.repartition, which requires data shuffling. -Xiangrui
On Sun, Aug 10, 2014
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of
For linear models, the constructors are now public. You can save the
weights to HDFS, then load the weights back and use the constructor to
create the model. -Xiangrui
On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote:
hello:
I want to know,if I use history data to
into a large partition if I cache it, so
same issues as #1?
For coalesce, could you share some best practice how to set the right number
of partitions to avoid locality problem?
Thanks!
On Tue, Aug 12, 2014 at 3:51 PM, Xiangrui Meng men...@gmail.com wrote:
Assuming that your data is very
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
-Xiangrui
On Tue, Aug 12, 2014 at 8:01 PM,
, Aug 13, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote:
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie
Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:
~~~
val rows = rdd.map { case (j, values) =
values.view.zipWithIndex.map { case (v, i) =
(i, (j, v))
}
}.groupByKey().map { case (i, entries) =
Guoqiang reported some results in his PRs
https://github.com/apache/spark/pull/828 and
https://github.com/apache/spark/pull/929 . But this is really
problem-dependent. -Xiangrui
On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote:
Hi,
Are there any experiments
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:
The categorical features must be encoded into indices starting from 0:
0, 1, ..., numCategories - 1. Then you can provide the
categoricalFeatureInfo map to specify which columns contain
categorical features and the number of categories in each. Joseph is
updating the user guide. But if you want to
, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
Thanks a lot Xiangrui. This will help.
On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui
On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net
, Xiangrui Meng men...@gmail.com wrote:
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014
We implemented chi-squared tests in v1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
and we will add more after v1.1. Feedback on which tests should come
first would be greatly appreciated. -Xiangrui
On Tue, Aug 19, 2014 at
other testes, which are also important when
using logistic regression to build score cards.
Xiaobo Gu
-- Original --
From: Xiangrui Meng;men...@gmail.com;
Send time: Wednesday, Aug 20, 2014 2:18 PM
To: guxiaobo1...@qq.com;
Cc: user@spark.apache.orguser
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector =
Hi Wei,
Please keep us posted about the performance result you get. This would
be very helpful.
Best,
Xiangrui
On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote:
Thank you all. Actually I was looking at JCUDA. Function wise this may be a
perfect solution to offload computation
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
kumar.soumi...@gmail.com wrote:
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files
before 1.2 (or a later version). But due to a bug
(https://issues.apache.org/jira/browse/HADOOP-10614), you should try
hadoop-2.5.0. -Xiangrui
On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote:
Hi,
I'm running
This is not supported in MLlib. Hopefully, we will add support for
weighted examples in v1.2. If you want to train weighted instances
with the current tree implementation, please try importance sampling
first to adjust the weights. For instance, an example with weight 0.3
is sampled with
We have a pending PR (https://github.com/apache/spark/pull/216) for
discretization but it has performance issues. We will try to spend
more time to improve it. -Xiangrui
On Tue, Sep 2, 2014 at 2:56 AM, filipus floe...@gmail.com wrote:
i guess i found it
I think they are the same. If you have hub (https://hub.github.com/)
installed, you can run
hub checkout https://github.com/apache/spark/pull/216
and then `sbt/sbt assembly`
-Xiangrui
On Wed, Sep 3, 2014 at 12:03 AM, filipus floe...@gmail.com wrote:
howto install? just clone by git clone
There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
Maybe copy the implementation of StandardScaler from 1.1 and use it in
v1.0.x. -Xiangrui
On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
It seems like the next release will add a nice
org.apache.spark.mllib.feature package but what is the recommended way to
, or at least
expose this parameter to users (*CoalescedRDD *is a private class, and
*RDD*.*coalesce* also don't have a parameter to control that).
On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com wrote:
Sorry, I missed #2. My suggestion is the same as #2. You need to set a
bigger
You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over-determined, while the
convergence rate depends on the condition number. Applying standard
scaling is popular heuristic to reduce the condition number.
If you are interested in
There is an undocumented configuration to put users jars in front of
spark jar. But I'm not very certain that it works as expected (and
this is why it is undocumented). Please try turning on
spark.yarn.user.classpath.first . -Xiangrui
On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen
When you submit the job to yarn with spark-submit, set --conf
spark.yarn.user.classpath.first=true .
On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza
pesp...@societyconsulting.com wrote:
I don't understand what you mean. Can you be more specific?
From: Victor
Could you attach the driver log? -Xiangrui
On Mon, Sep 8, 2014 at 7:23 AM, Hui Li hli161...@gmail.com wrote:
I am running a very simple example using the SVMWithSGD on Amazon EMR. I
haven't got any result after one hour long.
My instance-type is: m3.large
instance-count is: 3
Dataset
to partition
the problem cleanly and don't need consensus step (which is what I
implemented in the code)
Thanks
Deb
On Sep 7, 2014 11:35 PM, Xiangrui Meng men...@gmail.com wrote:
You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over
If you are using the Mahout's Multinomial Naive Bayes, it should be
the same as MLlib's. I tried MLlib with news20.scale downloaded from
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
and the test accuracy is 82.4%. -Xiangrui
On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.
Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui
On Mon, Sep 15, 2014 at 1:00 PM, jamborta jambo...@gmail.com wrote:
Hi all,
I have an RDD that contains around 50 columns. I need to sum each column,
which I am doing by running it through a for loop, creating an
Spark doesn't support MultipleOutput at this time. You can cache the
parent RDD. Then create RDDs from it and save them separately.
-Xiangrui
On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
I would like to define the names of my output in Spark, I have a process
Thanks for the update! -Xiangrui
On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason behind the accuracy mismatch.
After studying and applying the
Or you can use the factory method `Vectors.sparse`:
val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0)))
where numProducts should be the largest product id plus one.
Best,
Xiangrui
On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote:
Hi Sameer,
MLLib uses
The importance should be based on some statistics, for example, the
standard deviation of the feature column and the magnitude of the
weight. If the columns are scaled to unit standard deviation (using
StandardScaler), you can tell the importance by the absolute value of
the weight. But there are
Did you cache `features`? Without caching it is slow because we need
O(k) iterations. The storage requirement on the driver is about 2 * n
* k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
Computing U is also an expensive task in your case. We should use some
randomized SVD
We don't support kernels because it doesn't scale well. Please check
When to use LIBLINEAR but not LIBSVM on
http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's
suggestion on expanding features. -Xiangrui
On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam j...@cs.berkeley.edu wrote:
Could you create a JIRA for it? We can either remove special
characters or encode with alphanumerics. -Xiangrui
On Thu, Sep 18, 2014 at 3:50 PM, SK skrishna...@gmail.com wrote:
Hi,
The default log files for the Mllib examples use a rather long naming
convention that includes special
Does feature size 43839 equal to the number of terms? Check the output
dimension of your feature vectorizer and reduce number of partitions
to match the number of physical cores. I saw you set
spark.storage.memoryFaction to 0.0. Maybe it is better to keep the
default. Also please confirm the
You dataset is small. NaiveBayes should work under the default
settings, even in local mode. Could you try local mode first without
changing any Spark settings? Since your dataset is small, could you
save the vectorized data (RDD[LabeledPoint]) and send me a sample? I
want to take a look at the
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix
requires 784MB. The driver needs more storage for collecting result
from executors as well as making a copy for LAPACK's dgesvd. So you
need more memory. Do you need the full SVD? If not, try to use a small
k, e.g, 50. -Xiangrui
On
For the vectorizer, what's the output feature dimension and are you
creating sparse vectors or dense vectors? The model on the driver
consists of numClasses * numFeatures doubles. However, the driver
needs more memory in order to receive the task result (of the same
size) from executors. So you
Please also check the load balance of the RDD on YARN. How many
partitions are you using? Does it match the number of CPU cores?
-Xiangrui
On Thu, Sep 25, 2014 at 12:28 PM, bhusted brian.hus...@gmail.com wrote:
What is the size of your vector mine is set to 20? I am seeing slow results
as well
We removed commons-math3 from dependencies to avoid version conflict
with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1,
while breeze depends on commons-math3-3.3. 3.3 is not backward
compatible with 3.1.1. So we removed it because the breeze functions
we use do not touch
Hi Krishna,
Some planned features for MLlib 1.2 can be found via Spark JIRA:
http://bit.ly/1ywotkm , though this list is not fixed. The feature
freeze will happen by the end of Oct. Then we will cut branch-1.2 and
start QA. I don't recommend using branch-1.2 for hands-on tutorial
around Oct 29th
The test accuracy doesn't mean the total loss. All points between (-1,
1) can separate points -1 and +1 and give you 1.0 accuracy, but their
coressponding loss are different. -Xiangrui
On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote:
Hi
We have used LogisticRegression
You may need a cluster with more memory. The current ALS
implementation constructs all subproblems in memory. With rank=10,
that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings
need 2GB, not counting the overhead. ALS creates in/out blocks to
optimize the computation, which takes
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
dataset metadata, which may store the default value to replace missing
values. But no one is committed to work on this feature. For now, you
can filter out examples
ALS still needs to load and deserialize the in/out blocks (one by one)
from disk and then construct least squares subproblems. All happen in
RAM. The final model is also stored in memory. -Xiangrui
On Wed, Oct 1, 2014 at 4:36 AM, Alex T chiorts...@gmail.com wrote:
Hi, thanks for the reply.
I
Yes, the bigram in that demo only has two characters, which could
separate different character sets. -Xiangrui
On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei liquan...@gmail.com wrote:
The program computes hashing bi-gram frequency normalized by total number of
bigrams then filter out zero values.
The cost depends on the feature dimension, number of instances, number
of classes, and number of partitions. Do you mind sharing those
numbers? -Xiangrui
On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico mike.bern...@gmail.com wrote:
Hi Everyone,
I'm working on training mllib's Naive Bayes to
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0.
-Xiangrui
On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain ji...@sellpoints.com wrote:
So I am trying to print the model output from MLlib however I am only
getting things like the following:
Did you add a different version of breeze to the classpath? In Spark
1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze
version you used is different from the one comes with Spark, you might
see class not found. -Xiangrui
On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch
The current impl of ALS constructs least squares subproblems in
memory. So for rank 100, the total memory it requires is about 480,189
* 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For
rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA
for optimizing ALS:
It would be really helpful if you can help test the scalability of the
new ALS impl:
https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala
. It should be faster and more scalable, but the code is messy now.
Best,
Xiangrui
On Fri, Oct 3, 2014 at 11:57
101 - 200 of 464 matches
Mail list logo