Maybe copy the implementation of StandardScaler from 1.1 and use it in
v1.0.x. -Xiangrui
On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
It seems like the next release will add a nice
org.apache.spark.mllib.feature package but what is the recommended way to
, or at least
expose this parameter to users (*CoalescedRDD *is a private class, and
*RDD*.*coalesce* also don't have a parameter to control that).
On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com wrote:
Sorry, I missed #2. My suggestion is the same as #2. You need to set a
bigger
Hi Wei,
Please keep us posted about the performance result you get. This would
be very helpful.
Best,
Xiangrui
On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote:
Thank you all. Actually I was looking at JCUDA. Function wise this may be a
perfect solution to offload computation
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
kumar.soumi...@gmail.com wrote:
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files
before 1.2 (or a later version). But due to a bug
(https://issues.apache.org/jira/browse/HADOOP-10614), you should try
hadoop-2.5.0. -Xiangrui
On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote:
Hi,
I'm running
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector =
other testes, which are also important when
using logistic regression to build score cards.
Xiaobo Gu
-- Original --
From: Xiangrui Meng;men...@gmail.com;
Send time: Wednesday, Aug 20, 2014 2:18 PM
To: guxiaobo1...@qq.com;
Cc: user@spark.apache.orguser
We implemented chi-squared tests in v1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
and we will add more after v1.1. Feedback on which tests should come
first would be greatly appreciated. -Xiangrui
On Tue, Aug 19, 2014 at
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do phu...@vida.io wrote:
The categorical features must be encoded into indices starting from 0:
0, 1, ..., numCategories - 1. Then you can provide the
categoricalFeatureInfo map to specify which columns contain
categorical features and the number of categories in each. Joseph is
updating the user guide. But if you want to
, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
Thanks a lot Xiangrui. This will help.
On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui
On Tue, Aug 19, 2014 at 1:41 PM, durin m...@simon-schaefer.net
, Xiangrui Meng men...@gmail.com wrote:
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014
Guoqiang reported some results in his PRs
https://github.com/apache/spark/pull/828 and
https://github.com/apache/spark/pull/929 . But this is really
problem-dependent. -Xiangrui
On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote:
Hi,
Are there any experiments
, Aug 13, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote:
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie
Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:
~~~
val rows = rdd.map { case (j, values) =
values.view.zipWithIndex.map { case (v, i) =
(i, (j, v))
}
}.groupByKey().map { case (i, entries) =
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
-Xiangrui
On Tue, Aug 12, 2014 at 8:01 PM,
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of
For linear models, the constructors are now public. You can save the
weights to HDFS, then load the weights back and use the constructor to
create the model. -Xiangrui
On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote:
hello:
I want to know,if I use history data to
into a large partition if I cache it, so
same issues as #1?
For coalesce, could you share some best practice how to set the right number
of partitions to avoid locality problem?
Thanks!
On Tue, Aug 12, 2014 at 3:51 PM, Xiangrui Meng men...@gmail.com wrote:
Assuming that your data is very
If the file is big enough, you can try MLUtils.loadLibSVMFile with a
minPartitions argument. This doesn't shuffle data but it might not
give you the exact number of partitions. If you want to have the exact
number, use RDD.repartition, which requires data shuffling. -Xiangrui
On Sun, Aug 10, 2014
They are two different RDDs. Spark doesn't guarantee that the first
partition of RDD1 and the first partition of RDD2 will stay in the
same worker node. If that is the case, if you have 1000
single-partition RDDs the first worker will have very heavy load.
-Xiangrui
On Thu, Aug 7, 2014 at 2:20
Thanks for posting the solution! You can also append `% provided` to
the `spark-mllib` dependency line and remove `spark-core` (because
spark-mllib already depends on spark-core) to make the assembly jar
smaller. -Xiangrui
On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote:
i was
For the reduce/aggregate question, driver collects results in
sequence. We now use tree aggregation in MLlib to reduce driver's
load:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89
It is faster than aggregate when there are many
It is used in data loading:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76
On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote:
I followed the example in
Then this may be a bug. Do you mind sharing the dataset that we can
use to reproduce the problem? -Xiangrui
On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote:
Spark 1.0.1
thanks
--
View this message in context:
Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote:
Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
6 worker
ratings.map{ case Rating(u,m,r) = {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
The code doesn't work because the userFeatures and productFeatures
stored in the model are RDDs. You tried to serialize them into the
task closure, and execute `model.predict` on an
Besides durin's suggestion, please also confirm driver and executor
memory in the WebUI, since they are small according to the log:
14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)
-Xiangrui
Is the GPL library only available on the driver node? If that is the
case, you need to add them to `--jars` option of spark-submit.
-Xiangrui
On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei hangel...@gmail.com wrote:
I had the following error when trying to run a very simple spark job (which
uses
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver?
I think the problem is with the feature dimension. KDD data has more than
20M features and in v1.0.1, the driver collects the partial gradients one
by one, sums them up, does the update, and then sends the new weights back
Some extra work is needed to close the loop. One related example is
streaming linear regression added by Jeremy very recently:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
You can use a model trained offline
the total task reduce from 200 t0 64 after first stage just like this:
But I don't know if this is reasonable.
On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng men...@gmail.com wrote:
After you load the data in, call `.repartition(number of
executors).cache()`. If the data is evenly
Do you mind sharing more details, for example, specs of nodes and data
size? -Xiangrui
2014-07-29 2:51 GMT-07:00 John Wu j...@zamplus.com:
Hi all,
There is a problem we can’t resolve. We implement the OWLQN algorithm in
parallel with SPARK,
We don’t know why It is very slow in every
Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui
On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote:
Development is really rapid
Could you share more details about the dataset and the algorithm? For
example, if the dataset has 10M+ features, it may be slow for the driver to
collect the weights from executors (just a blind guess). -Xiangrui
On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim unname...@gmail.com wrote:
Hi, all
The weight vector is usually dense and if you have many partitions,
the driver may slow down. You can also take a look at the driver
memory inside the Executor tab in WebUI. Another setting to check is
the HDFS block size and whether the input data is evenly distributed
to the executors. Are the
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.
2.
Great! Thanks for testing the new features! -Xiangrui
On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote:
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui
On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote:
Hi,
In order to evaluate the ML classification accuracy, I am
That looks good to me since there is no Hadoop InputFormat for HDF5.
But remember to specify the number of partitions in sc.parallelize to
use all the nodes. You can change `process` to `read` which yields
records one-by-one. Then sc.parallelize(files,
numPartitions).flatMap(read) returns an RDD
I think this is nice to have. Feel free to create a JIRA for it and it
would be great if you can send a PR. Thanks! -Xiangrui
On Thu, Jul 24, 2014 at 12:39 PM, SK skrishna...@gmail.com wrote:
Hi,
The mllib.clustering.kmeans implementation supports a random or parallel
initialization mode to
It is ALS with setNonnegative. -Xiangrui
On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote:
Hi,
Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the
a shutdown
On Jul 2, 2014, at 0:08, Xiangrui Meng men...@gmail.com wrote:
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110
-Xiangrui
On Tue, Jul 1, 2014 at 12:55 AM, Charles Li
patch, the evaluation has been
processed successfully.
I expect that these patches are merged in the next major release (v1.1?).
Without them, it would be hard to use mllib for a large dataset.
Thanks,
Makoto
(2014/07/16 15:05), Xiangrui Meng wrote:
Hi Makoto,
I don't remember I wrote
try `sbt/sbt clean` first? -Xiangrui
On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote:
I am trying to build spark after cloning from github repo:
I am executing:
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
I am getting following error:
[warn]
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is
a maintenance release with bug fixes across several areas of Spark,
including Spark Core, PySpark, MLlib, Streaming, and GraphX. We
recommend all 0.9.x users to upgrade to this stable release.
Contributions to this release came
This is a useful feature but it may be hard to have it in v1.1 due to
limited time. Hopefully, we can support it in v1.2. -Xiangrui
On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen chenjiush...@gmail.com wrote:
It seems MLlib right now doesn't support weighted training, training samples
have
You can also try a different region. I tested us-west-2 yesterday, and
it worked well. -Xiangrui
On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Actually the script in the master branch is also broken (it's pointing to an
older AMI). Try 1.0.1 for launching
This is a known issue:
https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working
on it. -Xiangrui
On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang j...@uow.edu.au wrote:
So this is a bug unsolved (for java) yet?
From: Jack Yang [mailto:j...@uow.edu.au]
Sent: Friday, 18 July 2014 4:52
It was because of the latest change to task serialization:
https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a
The task size is no longer limited by akka.frameSize but we show
warning messages if the task size is above 100KB. Please check the
objects referenced in the
Nick's suggestion is a good approach for your data. The item factors to
broadcast should be a few MBs. -Xiangrui
On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux decho...@gmail.com wrote:
And you might want to apply clustering before. It is likely that every user
and every item are not
You can save RDDs to text files using RDD.saveAsTextFile and load it back using
sc.textFile. But make sure the record to string conversion is correctly
implemented if the type is not primitive and you have the parser to load them
back. -Xiangrui
On Jul 18, 2014, at 8:39 AM, Roch Denis
...
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
H#x2F;P : +60 18 2040 017
E-Mail : tp025...@ex.apiit.edu.my
amin_...@me.com
On Thursday, July 17, 2014 11:57 AM, Xiangrui Meng men...@gmail.com wrote:
kmeans.py contains a naive
1) This is a miss, unfortunately ... We will add support for
regularization and intercept in the coming v1.1. (JIRA:
https://issues.apache.org/jira/browse/SPARK-2550)
2) It has overflow problems in Python but not in Scala. We can
stabilize the computation by ensuring exp only takes a negative
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at
to less than 200
seconds from over 700 seconds.
I am not sure how to repartition the data to match the CPU cores. How do I
do it?
Thank you.
Ravi
On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng men...@gmail.com wrote:
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
, 2014 at 10:48 PM, Makoto Yui yuin...@gmail.com wrote:
Hello,
(2014/06/19 23:43), Xiangrui Meng wrote:
The execution was slow for more large KDD cup 2012, Track 2 dataset
(235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the
sequential aggregation of dense vectors
Then it may be a new issue. Do you mind creating a JIRA to track this
issue? It would be great if you can help locate the line in
BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui
On Tue, Jul 15, 2014 at 10:56 PM, crater cq...@ucmerced.edu wrote:
I don't really have my code,
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
Hi all,
I am encountering the following error:
INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
driver
application
Chris
On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com
wrote:
Hi all,
I am
Could you share the code of RecommendationALS and the complete
spark-submit command line options? Thanks! -Xiangrui
On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S srikrishna...@gmail.com wrote:
Using properties file: null
Main class:
RecommendationALS
Arguments:
_train.csv
_validation.csv
I don't think MLlib supports model serialization/deserialization. You
got the error because the constructor is private. I created a JIRA for
this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to
make sure it is implemented in v1.1. For now, you can modify the
KMeansModel and remove
crater, was the error message the same as what you posted before:
14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Akka client disassociated
14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on
You should return an iterator in mapPartitionsWIthIndex. This is from
the programming guide
(http://spark.apache.org/docs/latest/programming-guide.html):
mapPartitionsWithIndex(func): Similar to mapPartitions, but also
provides func with an integer value representing the index of the
partition,
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the
serialized weight vector. There is a JIRA about switching
automatically between sending through akka or broadcast:
https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui
On Mon, Jul 14, 2014 at 12:15 AM, crater
Is it on a standalone server? There are several settings worthing checking:
1) number of partitions, which should match the number of cores
2) driver memory (you can see it from the executor tab of the Spark
WebUI and set it with --driver-memory 10g
3) the version of Spark you were running
Best,
You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379
Best,
Xiangrui
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I ran the SparkKMeans example (not the
news20.binary's feature dimension is 1.35M. So the serialized task
size is above the default limit 10M. You need to set
spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this
parameter is not passed to executors automatically, which causes Spark
freezes. This was fixed in the latest
This is expensive but doable:
rdd.zipWithIndex().filter { case (_, idx) = idx = 10 idx 20 }.collect()
-Xiangrui
On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas
nicholas.cham...@gmail.com wrote:
Interesting question on Stack Overflow:
http://stackoverflow.com/q/24677180/877069
Basically, is
, Xiangrui Meng men...@gmail.com wrote:
It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:
export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20
and launched spark-shell with --driver-memory
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open a JIRA for Naive
Bayes. For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for
Well, I believe this is a correct implementation but please let us
know if you run into problems. The NaiveBayes implementation in MLlib
v1.0 supports sparse data, which is usually the case for text
classificiation. I would recommend upgrading to v1.0. -Xiangrui
On Tue, Jul 8, 2014 at 7:20 AM,
try sbt/sbt clean first
On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 smallmonkey...@hotmail.com wrote:
Hi guys,
when i try to compile the latest source by sbt/sbt compile, I got an error.
Can any one help me?
The following is the detail: it may cause by TestSQLContext.scala
[error]
[error]
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly
You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui
On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun
with slave2.
2) The execution was successful when run in local mode with reduced number
of partitions. Does this imply issues communicating/coordinating across
processes (i.e. driver, master and workers)?
Thanks,
Bharath
On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote
No, but it should be easy to add one. -Xiangrui
On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
Hi,
Is there a method in Spark/MLlib to convert DenseVector to SparseVector?
Best regards, Alexander
task ID 2
...
On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng men...@gmail.com wrote:
The feature dimension is small. You don't need a big akka.frameSize.
The default one (10M) should be sufficient. Did you cache the data
before calling LRWithSGD? -Xiangrui
On Thu, Jul 3, 2014 at 10:02 AM
Hi Thunder,
Please understand that both MLlib and breeze are in active
development. Before v1.0, we used jblas but in the public APIs we only
exposed Array[Double]. In v1.0, we introduced Vector that supports
both dense and sparse data and switched the backend to
breeze/netlib-java (except ALS).
Hi Dmitriy,
It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
linear algebra. I'm thinking about whether it is possible to make the
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
On Wed, Jul 2, 2014 at 8:23 PM, x wasedax...@gmail.com wrote:
Hello,
I a newbie to Spark MLlib and ran into a curious case when following the
instruction at
You can use either bin/run-example or bin/spark-summit to run example
code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui
On Tue, Jul 1, 2014 at
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110
-Xiangrui
On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote:
Hi Spark,
I am running LBFGS on our
We were not ready to expose it as a public API in v1.0. Both breeze
and MLlib are in rapid development. It would be possible to expose it
as a developer API in v1.1. For now, it should be easy to define a
toBreeze method in your own project. -Xiangrui
On Tue, Jul 1, 2014 at 12:17 PM, Koert
Could you post the code snippet and the error stack trace? -Xiangrui
On Mon, Jun 30, 2014 at 7:03 AM, Daniel Micol dmi...@gmail.com wrote:
Hello,
I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable
error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2
You were using an old version of numpy, 1.4? I think this is fixed in
the latest master. Try to replace vec.dot(target) by numpy.dot(vec,
target), or use the latest master. -Xiangrui
On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote:
Hi,
I modified the example code for
Hi Kyle,
A few questions:
1) Did you use `setIntercept(true)`?
2) How many features?
I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?
Best,
Xiangrui
On Fri, Jun 27, 2014 at 8:10 AM,
Try to use --executor-memory 12g with spark-summit. Or you can set it
in conf/spark-defaults.properties and rsync it to all workers and then
restart. -Xiangrui
On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng pc...@uow.edu.au wrote:
I give up, communication must be blocked by the complex EC2 network
This is a planned feature for v1.1. I'm going to work on it after v1.0.1
release. -Xiangrui
On Jun 20, 2014, at 6:46 AM, Charles Earl charles.ce...@gmail.com wrote:
Looking for something like scikit's grid search module.
C
Your data source is S3 and data is used twice. m1.large does not have very good
network performance. Please try file.count() and see how fast it goes. -Xiangrui
On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote:
Hi there,
We're trying out Spark and are
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
with the modification you mentioned since the paper is new. We may
need to spend more time to learn the trade-offs. Feel free to create a
JIRA for PLSA and we can move our discussion there. It would be great
if you can share
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
On Jun 19, 2014, at 2:01 AM, Makoto Yui yuin...@gmail.com wrote:
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did
,Integer is
unrelated to mllib.
Thanks,
Bharath
On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Xiangrui ,
I'm using 1.0.0.
Thanks,
Bharath
On 18-Jun-2014 1:43 am, Xiangrui Meng men...@gmail.com wrote:
Hi Bharath,
Thanks for posting the details
Hi Makoto,
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree
I used tree aggregate in this branch. It should help with the scalability.
Hi Jayati,
Thanks for asking! MLlib algorithms are all implemented in Scala. It
makes us easier to maintain if we have the implementations in one
place. For the roadmap, please visit
http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features
planned for v1.1. Before contributing new
Hi Bharath,
Thanks for posting the details! Which Spark version are you using?
Best,
Xiangrui
On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Hi,
(Apologies for the long mail, but it's necessary to provide sufficient
details considering the number of issues
, where n is the number of partitions. It would be
great if someone can help test its scalability.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 1:32 PM, Makoto Yui yuin...@gmail.com wrote:
Hi Xiangrui,
(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many
301 - 400 of 464 matches
Mail list logo