Hi Michael,
I can help check the current implementation. Would you please go to
https://spark-project.atlassian.net/browse/SPARK and create a ticket
about this issue with component MLlib? Thanks!
Best,
Xiangrui
On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman m...@allman.ms wrote:
Hi,
I'm
The factor matrix Y is used twice in implicit ALS computation, one to
compute global Y^T Y, and another to compute local Y_i^T C_i Y_i.
-Xiangrui
On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote:
I
Hi Michael,
I made couple changes to implicit ALS. One gives faster construction
of YtY (https://github.com/apache/spark/pull/161), which was merged
into master. The other caches intermediate matrix factors properly
(https://github.com/apache/spark/pull/165). They should give you the
same result
Sorry, the link was wrong. Should be
https://github.com/apache/spark/pull/131 -Xiangrui
On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman m...@allman.ms wrote:
Hi Xiangrui,
I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can
you explain?
Also, thanks for addressing
Hi Jaonary,
With the current implementation, you need to call Array.slice to make
each row an Array[Double] and cache the result RDD. There is a plan to
support block-wise input data and I will keep you informed.
Best,
Xiangrui
On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa
Glad to hear the speed-up. Wish we can improve the implementation
further in the future. -Xiangrui
On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman m...@allman.ms wrote:
I just ran a runtime performance comparison between 0.9.0-incubating and your
als branch. I saw a 1.5x improvement in
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
At the reduceBuyKey stage, it takes
. K=50. Here's the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
Thanks!
On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng men...@gmail.com wrote:
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k
K.
Does the size of the input data matters for the example? Currently I have 50M
rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote:
K = 50 is certainly a large number for k-means
/driver/spark-shell?
Thanks!
On 25 Mar, 2014, at 1:03 am, Xiangrui Meng men...@gmail.com wrote:
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number
From API docs: Zips this RDD with another one, returning key-value
pairs with the first element in each RDD, second element in each RDD,
etc. Assumes that the two RDDs have the *same number of partitions*
and the *same number of elements in each partition* (e.g. one was made
through a map on the
Kuipers ko...@tresata.com wrote:
got it thanks
On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:
This is fixed in https://github.com/apache/spark/pull/281. Please try
again with the latest master. -Xiangrui
On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko
...@gmail.com
Closes #281 from andrewor14/ui-storage-fix and squashes the
following commits:
408585a [Andrew Or] Fix storage UI bug
On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:
got it thanks
On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com
After sbt/sbt gen-diea, do not import as an SBT project but choose
open project and point it to the spark folder. -Xiangrui
On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen so...@cloudera.com wrote:
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix,
you can compute column summary statistics, gram matrix, covariance,
SVD, and PCA. We will provide multiplication for distributed matrices,
but not in v1.0. -Xiangrui
On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wxh...@gmail.com wrote:
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui
On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote:
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
If the first partition doesn't have enough records, then it may not
drop enough lines. Try
rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)
It might trigger a job.
Best,
Xiangrui
On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote:
Hi Chengi,
If you just want to skip first
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD
?
On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng men...@gmail.com wrote:
If the first partition doesn't have enough records, then it may not
drop enough lines. Try
rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)
It might trigger a job.
Best,
Xiangrui
On Wed, Apr 23, 2014 at 9:46
PipedRDD is an RDD[String]. If you know how to parse each result line
into (key, value) pairs, then you can call reduce after.
piped.map(x = (key, value)).reduceByKey((v1, v2) = v)
-Xiangrui
On Wed, Apr 23, 2014 at 2:09 AM, zhxfl 291221...@qq.com wrote:
Hello,we know Hadoop-streaming is use
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
usedforprinting...@gmail.com wrote:
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main $@
Any
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John King
usedforprinting...@gmail.com wrote:
I receive this error:
Traceback (most recent call last):
File stdin, line 1, in module
File
RDD (~7 million lines) and mapping. Just
received this error when trying to classify.
On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote:
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John
at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
usedforprinting...@gmail.com wrote:
./spark-shell: line 153: 17654 Killed
$FWDIR
= data.filter(isEmpty)
val points = empty.map(parsePoint)
points.cache()
val model = new NaiveBayes().run(points)
On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
Do you mind sharing more code and error messages? The information you
provided is too little
the lines of code
mentioned in the error have anything to do with it?
On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote:
I don't see anything wrong with your code. Could you do points.count()
to see how many training examples you have? Also, make sure you don't
have negative
How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote:
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points
Hi Diana,
SparkALS is an example implementation of ALS. It doesn't call the ALS
algorithm implemented in MLlib. M, U, and F are used to generate
synthetic data.
I'm updating the examples. In the meantime, you can take a look at the
updated MLlib guide:
Those are warning messages instead of errors. You need to add
netlib-java:all to use native BLAS/LAPACK. But it won't work if you
include netlib-java:all in an assembly jar. It has to be a separate
jar when you submit your job. For SGD, we only use level-1 BLAS, so I
don't think native code is
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui
On Mon, May 12, 2014 at 1:26 PM, Debasish Das debasish.da...@gmail.com wrote:
Hi,
I see precision and recall but no accuracy in mllib.evaluation.binary.
Is it already under development or it needs to be added ?
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui
On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
Hi all,
Is anyone reading and writing to .bz2 files stored in
I don't know whether this would fix the problem. In v0.9, you need
`yarn-standalone` instead of `yarn-cluster`.
See
https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08
On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote:
Does v0.9 support yarn
Could you try `println(result.toDebugString())` right after `val
result = ...` and attach the result? -Xiangrui
On Fri, May 9, 2014 at 8:20 AM, phoenix bai mingzhi...@gmail.com wrote:
after a couple of tests, I find that, if I use:
val result = model.predict(prdctpairs)
result.map(x =
If you check out the master branch, there are some examples that can
be used as templates under
examples/src/main/scala/org/apache/spark/examples/mllib
Best,
Xiangrui
On Wed, May 14, 2014 at 1:36 PM, yxzhao yxz...@ualr.edu wrote:
Hello,
I found the classfication algorithms SVM and
Hi Andrew,
Could you try varying the minPartitions parameter? For example:
val r = sc.textFile(/user/aa/myfile.bz2, 4).count
val r = sc.textFile(/user/aa/myfile.bz2, 8).count
Best,
Xiangrui
On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng men...@gmail.com wrote:
Which hadoop version did you use
On Thu, May 15, 2014 at 3:48 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Andrew,
Could you try varying the minPartitions parameter? For example:
val r = sc.textFile(/user/aa/myfile.bz2, 4).count
val r = sc.textFile(/user/aa/myfile.bz2, 8).count
Best,
Xiangrui
On Tue, May 13, 2014 at 9:08
Hi Andrew,
This is the JIRA I created:
https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully
someone wants to work on it.
Best,
Xiangrui
On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Andre,
I could reproduce the bug with Hadoop 2.2.0. Some older version
, Xiangrui Meng men...@gmail.com wrote:
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui
On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
Hi all,
Is anyone
You need to include breeze-natives or netlib:all to load the native
libraries. Check the log messages to ensure native libraries are used,
especially on the worker nodes. The easiest way to use OpenBLAS is
copying the shared library to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3. -Xiangrui
Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui
On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We are trying to read some large GraphML files to use in spark.
Is there an easy way to read XML-based files like this that
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui
On Wed, May 21, 2014 at 11:23 AM, yxzhao yxz...@ualr.edu wrote:
I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
If the RDD is cached, you can check its storage information in the
Storage tab of the Web UI.
On Wed, May 21, 2014 at 12:31 PM, yxzhao yxz...@ualr.edu wrote:
Thanks Xiangrui, How to check and make sure the data is distributed
evenly? Thanks again.
On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui
On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote:
I have a graph and am trying to take a random sample of vertices without
replacement, using the
Was the error message the same as you posted when you used `root` as
the user id? Could you try this:
1) Do not specify user id. (Default would be `root`.)
2) If it fails in the middle, try `spark-ec2 --resume launch
cluster` to continue launching the cluster.
Best,
Xiangrui
On Thu, May
The documentation you looked at is not official, though it is from
@pwendell's website. It was for the Spark SQL release. Please find the
official documentation here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm
It contains a working example
Hi Tobias,
One hack you can try is:
rdd.mapPartitions(iter = {
val x = new X()
iter.map(row = x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty }
})
Best,
Xiangrui
On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I want to use an object x in my RDD
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui
On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991
praveshjain1...@gmail.com wrote:
I am not sure. I have just been using some numerical datasets.
--
View this message in context:
Hi Suela,
(Please subscribe our user mailing list and send your questions there
in the future.) For your case, each file contains a column of numbers.
So you can use `sc.textFile` to read them first, zip them together,
and then create labeled points:
val xx = sc.textFile(/path/to/ex2x.dat).map(x
Did you try sc.stop()?
On Tue, Jun 3, 2014 at 9:54 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
Hi,
I want to know how I can stop a running SparkContext in a proper way so that
next time when I start a new SparkContext, the web UI can be launched on the
same port 4040.Now when i quit the
Could you check whether the vectors have the same size? -Xiangrui
On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
what does this exception mean?
14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
java.lang.IllegalArgumentException: requirement failed
80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
take that long, even on a single executor. Besides what Matei
suggested, could you also verify the executor memory in
http://localhost:4040 in the Executors tab. It is very likely the
executors do not have enough memory. In that
Hi Krishna,
Specifying executor memory in local mode has no effect, because all of
the threads run inside the same JVM. You can either try
--driver-memory 60g or start a standalone server.
Best,
Xiangrui
On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng men...@gmail.com wrote:
80M by 4 should
For standalone and yarn mode, you need to install native libraries on all
nodes. The best solution is installing them to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3 . If your matrix is sparse, the native libraries cannot
help because they are for dense linear algebra. You can create RDD
At this time, you need to do one-vs-all manually for multiclass
training. For your second question, if the algorithm is implemented in
Java/Scala/Python and designed for single machine, you can broadcast
the dataset to each worker, train models on workers. If the algorithm
is implemented in a
Hi dlaw,
You are using breeze-0.8.1, but the spark assembly jar depends on
breeze-0.7. If the spark assembly jar comes the first on the classpath
but the method from DenseMatrix is only available in breeze-0.8.1, you
get NoSuchMethod. So,
a) If you don't need the features in breeze-0.8.1, do not
Hi Tobias,
Which file system and which encryption are you using?
Best,
Xiangrui
On Sun, Jun 8, 2014 at 10:16 PM, Xiangrui Meng men...@gmail.com wrote:
Hi dlaw,
You are using breeze-0.8.1, but the spark assembly jar depends on
breeze-0.7. If the spark assembly jar comes the first
For broadcast data, please read
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
.
For one-vs-all, please read
https://en.wikipedia.org/wiki/Multiclass_classification .
-Xiangrui
On Mon, Jun 9, 2014 at 7:24 AM, littlebird cxp...@163.com wrote:
Thank you for your
Could you try to click one that RDD and see the storage info per
partition? I tried continuously caching RDDs, so new ones kick old
ones out when there is not enough memory. I saw similar glitches but
the storage info per partition is correct. If you find a way to
reproduce this error, please
You can create tf vectors and then use
RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For
tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth
having a simple interface for it. -Xiangrui
On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
1.
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
contains example code that shows how to set regParam.
2. A static method with more than 3 parameters becomes hard to
remember and hard to maintain. Please use LogistricRegressionWithSGD's
default constructor
as in the example you mentioned, but the
source code reveals that the intercept is also penalized if one is included,
which is usually inappropriate. The developer should fix this problem.
Best,
Congrui
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Friday, June 13, 2014
Hi Makoto,
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree
I used tree aggregate in this branch. It should help with the scalability.
Hi Jayati,
Thanks for asking! MLlib algorithms are all implemented in Scala. It
makes us easier to maintain if we have the implementations in one
place. For the roadmap, please visit
http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features
planned for v1.1. Before contributing new
Hi Bharath,
Thanks for posting the details! Which Spark version are you using?
Best,
Xiangrui
On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Hi,
(Apologies for the long mail, but it's necessary to provide sufficient
details considering the number of issues
, where n is the number of partitions. It would be
great if someone can help test its scalability.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 1:32 PM, Makoto Yui yuin...@gmail.com wrote:
Hi Xiangrui,
(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many
Hi Makoto,
Are you using Spark 1.0 or 0.9? Could you go to the executor tab of
the web UI and check the driver's memory?
treeAggregate is not part of 1.0.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng men...@gmail.com wrote:
Hi DB,
treeReduce (treeAggregate) is a feature I'm
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng men...@gmail.com wrote:
Hi DB,
treeReduce (treeAggregate) is a feature I'm testing now. It is a
compromise between current reduce and butterfly
Makoto, please use --driver-memory 8G when you launch spark-shell. -Xiangrui
On Tue, Jun 17, 2014 at 4:49 PM, Xiangrui Meng men...@gmail.com wrote:
DB, Yes, reduce and aggregate are linear.
Makoto, dense vectors are used to in aggregation. If you have 32
partitions and each one sending
,Integer is
unrelated to mllib.
Thanks,
Bharath
On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Xiangrui ,
I'm using 1.0.0.
Thanks,
Bharath
On 18-Jun-2014 1:43 am, Xiangrui Meng men...@gmail.com wrote:
Hi Bharath,
Thanks for posting the details
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
with the modification you mentioned since the paper is new. We may
need to spend more time to learn the trade-offs. Feel free to create a
JIRA for PLSA and we can move our discussion there. It would be great
if you can share
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
On Jun 19, 2014, at 2:01 AM, Makoto Yui yuin...@gmail.com wrote:
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did
This is a planned feature for v1.1. I'm going to work on it after v1.0.1
release. -Xiangrui
On Jun 20, 2014, at 6:46 AM, Charles Earl charles.ce...@gmail.com wrote:
Looking for something like scikit's grid search module.
C
Your data source is S3 and data is used twice. m1.large does not have very good
network performance. Please try file.count() and see how fast it goes. -Xiangrui
On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote:
Hi there,
We're trying out Spark and are
Hi Kyle,
A few questions:
1) Did you use `setIntercept(true)`?
2) How many features?
I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?
Best,
Xiangrui
On Fri, Jun 27, 2014 at 8:10 AM,
Try to use --executor-memory 12g with spark-summit. Or you can set it
in conf/spark-defaults.properties and rsync it to all workers and then
restart. -Xiangrui
On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng pc...@uow.edu.au wrote:
I give up, communication must be blocked by the complex EC2 network
Could you post the code snippet and the error stack trace? -Xiangrui
On Mon, Jun 30, 2014 at 7:03 AM, Daniel Micol dmi...@gmail.com wrote:
Hello,
I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable
error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2
You were using an old version of numpy, 1.4? I think this is fixed in
the latest master. Try to replace vec.dot(target) by numpy.dot(vec,
target), or use the latest master. -Xiangrui
On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote:
Hi,
I modified the example code for
You can use either bin/run-example or bin/spark-summit to run example
code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui
On Tue, Jul 1, 2014 at
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110
-Xiangrui
On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote:
Hi Spark,
I am running LBFGS on our
We were not ready to expose it as a public API in v1.0. Both breeze
and MLlib are in rapid development. It would be possible to expose it
as a developer API in v1.1. For now, it should be easy to define a
toBreeze method in your own project. -Xiangrui
On Tue, Jul 1, 2014 at 12:17 PM, Koert
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
On Wed, Jul 2, 2014 at 8:23 PM, x wasedax...@gmail.com wrote:
Hello,
I a newbie to Spark MLlib and ran into a curious case when following the
instruction at
Hi Thunder,
Please understand that both MLlib and breeze are in active
development. Before v1.0, we used jblas but in the public APIs we only
exposed Array[Double]. In v1.0, we introduced Vector that supports
both dense and sparse data and switched the backend to
breeze/netlib-java (except ALS).
Hi Dmitriy,
It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
linear algebra. I'm thinking about whether it is possible to make the
task ID 2
...
On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng men...@gmail.com wrote:
The feature dimension is small. You don't need a big akka.frameSize.
The default one (10M) should be sufficient. Did you cache the data
before calling LRWithSGD? -Xiangrui
On Thu, Jul 3, 2014 at 10:02 AM
with slave2.
2) The execution was successful when run in local mode with reduced number
of partitions. Does this imply issues communicating/coordinating across
processes (i.e. driver, master and workers)?
Thanks,
Bharath
On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote
No, but it should be easy to add one. -Xiangrui
On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
Hi,
Is there a method in Spark/MLlib to convert DenseVector to SparseVector?
Best regards, Alexander
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open a JIRA for Naive
Bayes. For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for
Well, I believe this is a correct implementation but please let us
know if you run into problems. The NaiveBayes implementation in MLlib
v1.0 supports sparse data, which is usually the case for text
classificiation. I would recommend upgrading to v1.0. -Xiangrui
On Tue, Jul 8, 2014 at 7:20 AM,
try sbt/sbt clean first
On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 smallmonkey...@hotmail.com wrote:
Hi guys,
when i try to compile the latest source by sbt/sbt compile, I got an error.
Can any one help me?
The following is the detail: it may cause by TestSQLContext.scala
[error]
[error]
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly
You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui
On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun
, Xiangrui Meng men...@gmail.com wrote:
It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:
export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20
and launched spark-shell with --driver-memory
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I ran the SparkKMeans example (not the
news20.binary's feature dimension is 1.35M. So the serialized task
size is above the default limit 10M. You need to set
spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this
parameter is not passed to executors automatically, which causes Spark
freezes. This was fixed in the latest
This is expensive but doable:
rdd.zipWithIndex().filter { case (_, idx) = idx = 10 idx 20 }.collect()
-Xiangrui
On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas
nicholas.cham...@gmail.com wrote:
Interesting question on Stack Overflow:
http://stackoverflow.com/q/24677180/877069
Basically, is
You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379
Best,
Xiangrui
You should return an iterator in mapPartitionsWIthIndex. This is from
the programming guide
(http://spark.apache.org/docs/latest/programming-guide.html):
mapPartitionsWithIndex(func): Similar to mapPartitions, but also
provides func with an integer value representing the index of the
partition,
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the
serialized weight vector. There is a JIRA about switching
automatically between sending through akka or broadcast:
https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui
On Mon, Jul 14, 2014 at 12:15 AM, crater
Is it on a standalone server? There are several settings worthing checking:
1) number of partitions, which should match the number of cores
2) driver memory (you can see it from the executor tab of the Spark
WebUI and set it with --driver-memory 10g
3) the version of Spark you were running
Best,
Could you share the code of RecommendationALS and the complete
spark-submit command line options? Thanks! -Xiangrui
On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S srikrishna...@gmail.com wrote:
Using properties file: null
Main class:
RecommendationALS
Arguments:
_train.csv
_validation.csv
1 - 100 of 464 matches
Mail list logo