The C implementation of Word2Vec updates the model using multi-threads
without locking. It is hard to implement it in a distributed way. In
the MLlib implementation, each work holds the entire model in memory
and output the part of model that gets updated. The driver still need
to collect and
There is some serialization overhead. You can try
https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
. -Xiangrui
On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote:
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector.
The Julia code is computing the SVD of the Gram matrix. PCA should be
applied to the covariance matrix. -Xiangrui
On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote:
Hi All,
I tried to do PCA for the Iris dataset
[https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance should be improved in 1.3.0. -Xiangrui
On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
Hi Ilya,
thanks for your insight, this was the right clue. I had default parallelism
already
huge?
On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng men...@gmail.com wrote:
Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using
Try increasing the driver memory. We store trees on the driver node.
If maxDepth=20 and numTrees=50, you may need a large driver memory to
store all tree models. You might want to start with a smaller maxDepth
and then increase it and see whether deep trees really help (vs. the
cost). -Xiangrui
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and
Shuo is working on it. We will first implement topByKey for RDD and
them we could add it to DataFrames. -Xiangrui
On Mon, Mar 9, 2015 at 9:43 PM, Moss rhoud...@gmail.com wrote:
I do have a schemaRDD where I want to group by
The official guide may help:
http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
-Xiangrui
On Tue, Mar 17, 2015 at 8:27 AM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I am getting very high GC time in my jobs. For smaller/real-time load, this
becomes a real problem.
!
Thanks,
Jay
On Mar 16, 2015, at 11:35 AM, Xiangrui Meng men...@gmail.com wrote:
Try this:
val ratings = purchase.map { line =
line.split(',') match { case Array(user, item, rate) =
(user.toInt, item.toInt, rate.toFloat)
}.toDF(user, item, rate)
Doc for DataFrames:
http
This is the default value (Double.MinValue) for invalid gain:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67
Please ignore it. Maybe we should update `toString` to use scientific notation.
-Xiangrui
On Mon, Mar
Please check your classpath and make sure you don't have multiple
Spark versions deployed. If the classpath looks correct, please create
a JIRA for this issue. Thanks! -Xiangrui
On Tue, Mar 17, 2015 at 2:03 AM, Jeffrey Jedele
jeffrey.jed...@gmail.com wrote:
Hi all,
I'm trying to use the new LDA
that I needed to bug you :)
Jay
On Mar 17, 2015, at 11:48 AM, Xiangrui Meng men...@gmail.com wrote:
Please check this section in the user guide:
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
You need `import sqlContext.implicits
17, 2015, at 11:53 AM, Xiangrui Meng men...@gmail.com wrote:
This is the default value (Double.MinValue) for invalid gain:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67
Please ignore it. Maybe we should update
Su, which Spark version did you use? -Xiangrui
On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
To get these metrics out, you need to open the driver ui running on port
4040. And in there you will see Stages information and for each stage you
can see how much time
You can compute the standard deviations of the training data using
Statistics.colStats and then compare them with model coefficients to
compute feature importance. -Xiangrui
On Fri, Mar 13, 2015 at 11:35 AM, Natalia Connolly
natalia.v.conno...@gmail.com wrote:
Hello,
While running an
Actually, they should be INFO or DEBUG. Line search steps are
expected. You can configure log4j.properties to ignore those. A better
solution would be reporting this at
https://github.com/scalanlp/breeze/issues -Xiangrui
On Thu, Mar 12, 2015 at 5:46 PM, cjwang c...@cjwang.us wrote:
I am running
Try this:
val ratings = purchase.map { line =
line.split(',') match { case Array(user, item, rate) =
(user.toInt, item.toInt, rate.toFloat)
}.toDF(user, item, rate)
Doc for DataFrames:
http://spark.apache.org/docs/latest/sql-programming-guide.html
-Xiangrui
On Mon, Mar 16, 2015 at 9:08 AM,
You need to change `== 1` to `== i`. `println(t)` happens on the
workers, which may not be what you want. Try the following:
noSets.filter(t = model.predict(Utils.featurize(t)) ==
i).collect().foreach(println)
-Xiangrui
On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb
richard.pierce.l...@gmail.com
cache() is lazy. The data is stored into memory after the first time
it gets materialized. So the first time you call `predict` after you
load the model back from HDFS, it still takes time to load the actual
data. The second time will be much faster. Or you can call
`userJavaRDD.count()` and
Hi Xi,
Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?
Best,
Xiangrui
On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
Hi Burak,
After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set
to get a similar
result in 1.3.
Sean and Shuo, which approach do you prefer? Do you know any existing
work discussing this?
Best,
Xiangrui
On Fri, Mar 27, 2015 at 11:27 AM, Xiangrui Meng men...@gmail.com wrote:
This sounds like a bug ... Did you try a different lambda? It would be
great if you
This PR updated the k-means|| initialization:
https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
which was included in 1.3.0. It should fix kmean|| initialization with
large k. Please create a JIRA for this issue and send me the code and the
dataset to produce this
Hey Xi,
Have you tried Spark 1.3.0? The initialization happens on the driver node
and we fixed an issue with the initialization in 1.3.0. Again, please start
with a smaller k, and increase it gradually, Let us know at what k the
problem happens.
Best,
Xiangrui
On Sat, Mar 28, 2015 at 3:11 AM,
You can extend Gradient, e.g.,
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266,
and use it in GradientDescent:
has
vectors of 200 dimensions.
It is possible people never tested large dimension case.
Thanks,
David
On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote:
Hi Xi,
Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?
Best,
Xiangrui
as the input grew anyway.
So, basically I don't know anything more than you do, sorry!
On Tue, Mar 31, 2015 at 10:41 PM, Xiangrui Meng men...@gmail.com wrote:
Hey Sean,
That is true for explicit model, but not for implicit. The ALS-WR
paper doesn't cover the implicit model. In implicit formulation
to solve my problem?
Sendong Li
在 2015年3月31日,上午12:11,Xiangrui Meng men...@gmail.com 写道:
setCheckpointInterval was added in the current master and branch-1.3.
Please help check whether it works. It will be included in the 1.3.1 and
1.4.0 release. -Xiangrui
On Mon, Mar 30, 2015 at 7:27 AM
I cannot reproduce this error on master, but I'm not aware of any
recent bug fixes that are related. Could you build and try the current
master? -Xiangrui
On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
DataFrame with an user defined type (here mllib.Vector)
it comes to invariance. But FWIW I had always
understood the regularization to be multiplied by the number of
explicit ratings.
On Mon, Mar 30, 2015 at 5:51 PM, Xiangrui Meng men...@gmail.com wrote:
Okay, I didn't realize that I changed the behavior of lambda in 1.3.
to make it scale
Also try 1.3.0-RC1 or the current master. ALS should performance much
better in 1.3. -Xiangrui
On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You need to increase the parallelism/repartition the data to a higher number
to get ride of those.
Thanks
Best Regards
+user
On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng men...@gmail.com wrote:
You can use the save/load implementation in naive Bayes as reference:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Ping me on the JIRA page
libgfortran.x86_64 4.1.2-52.el5_8.1 comes with libgfortran.so.1 but
not libgfortran.so.3. JBLAS requires the latter. If you have root
access, you can try to install a newer version of libgfortran.
Otherwise, maybe you can try Spark 1.3, which doesn't use JBLAS in
ALS. -Xiangrui
On Tue, Mar 3,
You can use rdd.cartesian then find top-k by key to distribute the
work to executors. There is a trick to boost the performance: you need
to blockify user/product features and then use native matrix-matrix
multiplication. There is a relevant PR from Deb:
https://github.com/apache/spark/pull/3098 .
Did you try to use less number of partitions (user/product blocks)?
Did you use implicit feedback? In the current implementation, we only
do checkpointing with implicit feedback. We should adopt the
checkpoint strategy implemented in LDA:
Which Spark version did you use? Btw, there are three datasets from
MovieLens. The tutorial used the medium one (1 million). -Xiangrui
On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez guilla...@databerries.com wrote:
What do you mean?
--
View this message in context:
Yes, we are going to expose the developer API. There was a long
discussion in the PR: https://github.com/apache/spark/pull/3637. So we
marked them package private and look for feedback on how to improve
it. Please implement your classes under `spark.ml` for now and let us
know your feedback.
Made 3 votes to each of the talks. Looking forward to see them in
Hadoop Summit:) -Xiangrui
On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote:
Hi all,
The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could
= 1.0, and numIter = 20)
1.1.1 - RSME = 1.335831 (rank = 8 and lambda = 1.0, and numIter = 10)
Cheers
k/
On Mon, Feb 23, 2015 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:
Which Spark version did you use? Btw, there are three datasets from
MovieLens. The tutorial used the medium one (1
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add
save/load support to Python soon. -Xiangrui
On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez
sebastian.rami...@senseta.com wrote:
In your log it says:
pickle.PicklingError: Can't pickle type 'thread.lock': it's not
Lisen, did you use all m-by-n pairs during training? Implicit model
penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
On Feb 26, 2015 6:26 AM, Sean Owen so...@cloudera.com wrote:
+user
On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen so...@cloudera.com wrote:
I think I may
If you make `Image` a case class, then select(image.data) should work.
On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I have a DataFrame that contains a user defined type. The type is an image
with the following attribute
class Image(w: Int, h: Int,
Btw, the correct syntax for alias should be
`df.select($image.data.as(features))`.
On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote:
If you make `Image` a case class, then select(image.data) should work.
On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com
It may take some work to do online updates with an
MatrixFactorizationModel because you need to update some rows of the
user/item factors. You may be interested in spark-indexedrdd
(http://spark-packages.org/package/amplab/spark-indexedrdd).
We support save/load in Scala/Java. We are going to add
Try the following:
df.map { case Row(id: Int, num: Int, value: Double, x: Float) = //
replace those with your types
(id, Vectors.dense(num, value, x))
}.toDF(id, features)
-Xiangrui
On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk mobsn...@gmail.com wrote:
I've been searching around and see others
It is not in MLlib. There is a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an
implementation for integer values. -Xiangrui
On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:
Has KNN classification algorithm been implemented on MLlib?
This is a PR in review to support ORC via the SQL data source API:
https://github.com/apache/spark/pull/3753. You can try pulling that PR
and help test it. -Xiangrui
On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote:
Hi,
I use sc.hadoopFile(directory,
Hi Martin,
Could you attach the code snippet and the stack trace? The default
implementation of some methods uses reflection, which may be the
cause.
Best,
Xiangrui
On Wed, Mar 25, 2015 at 3:18 PM, zapletal-mar...@email.cz wrote:
Thanks Peter,
I ended up doing something similar. I however
This sounds like a bug ... Did you try a different lambda? It would be
great if you can share your dataset or re-produce this issue on the
public dataset. Thanks! -Xiangrui
On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody rmody...@gmail.com wrote:
After upgrading to 1.3.0, ALS.trainImplicit() has been
We don't support warm starts or online updates for decision trees. So
if you call train twice, only the second dataset is used for training.
-Xiangrui
On Thu, Mar 5, 2015 at 12:31 PM, drarse drarse.a...@gmail.com wrote:
I am testing the Random Forest in Spark, but I have a question... If I train
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642
and used the same lambda scaling as in 1.2. The change will be
included in Spark 1.3.1, which will be released soon. Thanks for
reporting this issue! -Xiangrui
On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng men...@gmail.com
I think before 1.3 you also get stackoverflow problem in ~35
iterations. In 1.3.x, please use setCheckpointInterval to solve this
problem, which is available in the current master and 1.3.1 (to be
released soon). Btw, do you find 80 iterations are needed for
convergence? -Xiangrui
On Wed, Apr 1,
:18 PM, Xiangrui Meng men...@gmail.com wrote:
I cannot reproduce this error on master, but I'm not aware of any
recent bug fixes that are related. Could you build and try the current
master? -Xiangrui
On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all
Correct. Prediction doesn't touch that code path. -Xiangrui
On Mon, Apr 13, 2015 at 9:58 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
Hi,
In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it
says if userFeatureScaling is enabled, we will standardize the training
Did you keep adding new files under the `train/` folder? What was the
exact warn message? -Xiangrui
On Fri, Apr 17, 2015 at 4:56 AM, barisak baris.akg...@gmail.com wrote:
Hi,
I write this code for just train the Stream Linear Regression, but I took no
data found warn, so no weights were not
You can find the user guide for vector creation here:
http://spark.apache.org/docs/latest/mllib-data-types.html#local-vector.
-Xiangrui
On Mon, Apr 20, 2015 at 2:32 PM, Dan DeCapria, CivicScience
dan.decap...@civicscience.com wrote:
Hi Spark community,
I'm very new to the Apache Spark
Could you attach the full stack trace? Please also include the stack
trace from executors, which you can find on the Spark WebUI. -Xiangrui
On Thu, Apr 16, 2015 at 1:00 PM, riginos samarasrigi...@gmail.com wrote:
I have a big dataset of categories of cars and descriptions of cars. So i
want to
You should check where MyDenseVectorUDT is defined and whether it was
on the classpath (or in the assembly jar) at runtime. Make sure the
full class name (with package name) is used. Btw, UDTs are not public
yet, so please use it with caution. -Xiangrui
On Fri, Apr 17, 2015 at 12:45 AM, Jaonary
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
1.3. We should allow DataFrames in ALS.train. I will submit a patch.
You can use `ALS.train(training.rdd, ...)` for now as a workaround.
-Xiangrui
On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley jos...@databricks.com wrote:
This is the size of the serialized task closure. Is stage 246 part of
ALS iterations, or something before or after it? -Xiangrui
On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone
christian.per...@gmail.com wrote:
Hi Sean, thanks for the answer. I tried to call repartition() on the input
If by C you mean the parameter C in LIBLINEAR, the corresponding
parameter in MLlib is regParam:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py#L273,
while regParam = 1/C. -Xiangrui
On Wed, Apr 22, 2015 at 3:25 PM, Pagliari, Roberto
rpagli...@appcomsci.com
The patched was merged and it will be included in 1.3.2 and 1.4.0.
Thanks for reporting the bug! -Xiangrui
On Tue, Apr 21, 2015 at 2:51 PM, ayan guha guha.a...@gmail.com wrote:
Thank you all.
On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote:
SchemaRDD subclasses RDD in 1.2
ALS.setCheckpointInterval was added in Spark 1.3.1. You need to
upgrade Spark to use this feature. -Xiangrui
On Wed, Apr 22, 2015 at 9:03 PM, amghost zhengweita...@outlook.com wrote:
Hi, would you please how to checkpoint the training set rdd since all things
are done in ALS.train method.
Please try reducing the step size. The native BLAS library is not
required. -Xiangrui
On Tue, Apr 21, 2015 at 5:15 AM, Staffan staffan.arvids...@gmail.com wrote:
Hi,
I've written an application that performs some machine learning on some
data. I've validated that the data _should_ give a good
We store the vectors on the driver node. So it is hard to handle a
really large vocabulary. You can use setMinCount to filter out
infrequent word to reduce the model size. -Xiangrui
On Wed, Apr 22, 2015 at 12:32 AM, gm yu husty...@gmail.com wrote:
When use Mllib.Word2Vec, I meet the following
Having ordered indices is a contract of SparseVector:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector.
We do not verify it for performance. -Xiangrui
On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan yaochun...@gmail.com wrote:
Hi all,
I am using
to see there is
error
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LDA-code-little-error-Xiangrui-Meng-tp22621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
What is the feature dimension? Did you set the driver memory? -Xiangrui
On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote:
I'm trying to use the StandardScaler in pyspark on a relatively small (a few
hundred Mb) dataset of sparse vectors with 800k features. The fit method of
How did you run the example app? Did you use spark-submit? -Xiangrui
On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
Sorry, accidentally sent the last email before finishing.
I had asked this question before, but wanted to ask again as I think
it is now related to my pom
)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
On Thu, Apr 23, 2015 at 2:49 AM, Xiangrui Meng men...@gmail.com wrote:
This is the size of the serialized task closure. Is stage 246 part of
ALS iterations, or something
-- I have to run in client mode so
I can't set spark.driver.memory -- I've tried setting the
spark.yarn.am.memory and overhead parameters but it doesn't seem to have an
effect.
Thanks,
Rok
On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote:
What is the feature dimension
We will try to make them available in 1.4, which is coming soon. -Xiangrui
On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
I know grid search with cross validation is not supported. However, I was
wondering if there is something availalable for the time being.
Did you set `--driver-memory` with spark-submit? -Xiangrui
On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote:
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760).
The spark (1.3.1) job is allocated 120 executors with 6GB each and the
driver also
Reducing the number of instances won't help in this case. We use the
driver to collect partial gradients. Even with tree aggregation, it
still puts heavy workload on the driver with 20M features. Please try
to reduce the number of partitions before training. We are working on
a more scalable
in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or a bug?
rok
On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng men...@gmail.com wrote:
You might need to specify driver memory
LogisticRegressionWithSGD doesn't support multi-class. Please use
LogisticRegressionWithLBFGS instead. -Xiangrui
On Mon, Apr 27, 2015 at 12:37 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
With the Python APIs, the available arguments I got (using inspect module)
are the following:
?
Thanks,
Jay
On Apr 9, 2015, at 4:38 PM, Xiangrui Meng men...@gmail.com wrote:
Could you share ALSNew.scala? Which Scala version did you use? -Xiangrui
On Wed, Apr 8, 2015 at 4:09 PM, Jay Katukuri jkatuk...@apple.com wrote:
Hi Xiangrui,
I tried running this on my local machine
CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui
On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 linhe_ji...@163.com wrote:
Dear meng:
Thanks for the great work for park machine learning, and I saw the
changes for NaiveBayes algorithm ,
separate the algorithm to : multinomial
If you are using Scala/Java or
pyspark.mllib.classification.LogisticRegressionModel, you should be
able to call weights and intercept to get the model coefficients. If
you are using the pipeline API in Python, you can try
model._java_model.weights(), we are going to add a method to get the
weights
If you want to see an example that calls MLlib's FPGrowth, you can
find them under the examples/ folder:
Scala:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/FPGrowthExample.scala,
Java:
tried passing the spark-sql jar using the -jar
spark-sql_2.11-1.3.0.jar
Thanks,
Jay
On Mar 17, 2015, at 12:50 PM, Xiangrui Meng men...@gmail.com wrote:
Please remember to copy the user list next time. I might not be able
to respond quickly. There are many others who can help or who can
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If
that is the case, you need to cast them explicitly:
rdd.map { case (label, features) = LabeledPoint(label, features) }
-Xiangrui
On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact joannenetw...@gmail.com wrote:
Hello Sparkers,
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF
to do the mapping.
val labelToIndex = udf { ... }
featureDF.withColumn(f3_dummy, labelToIndex(col(f3)))
See instructions here
We support sparse vectors in MLlib, which recognizes MLlib's sparse
vector and SciPy's csc_matrix with a single column. You can create RDD
of sparse vectors for your data and save/load them to/from parquet
format using dataframes. Sparse matrix supported will be added in 1.4.
-Xiangrui
On Mon,
)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks,
Jay
On Apr 6, 2015, at 12:24 PM, Xiangrui Meng men...@gmail.com wrote:
Please attach the full stack trace. -Xiangrui
On Mon, Apr 6, 2015 at 12:06 PM, Jay
In 1.3, you can use model.save(sc, hdfs path). You can check the
code examples here:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples.
-Xiangrui
On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote:
Hello Zhou,
You can look at the
Could you try `sbt package` or `sbt compile` and see whether there are
errors? It seems that you haven't reached the ALS code yet. -Xiangrui
On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil)
pyada...@cisco.com wrote:
Hi ,
I am trying to run the following command in the Movie
Sorry, it should be toDF(text, id).
On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng men...@gmail.com wrote:
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui
On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote:
What would be the most efficient neat method
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui
On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote:
What would be the most efficient neat method to add a column with row ids to
dataframe?
I can think of something as below, but it completes with errors (at
This could be empirically verified in spark-perf:
https://github.com/databricks/spark-perf. Theoretically, it would be
2x for k-means and logistic regression, because computation is doubled
but communication cost remains the same. -Xiangrui
On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv
The documentation needs to be updated to state that higher metric
values are better (https://issues.apache.org/jira/browse/SPARK-7740).
I don't know why if you negate the return value of the Evaluator you
still get the highest regularization parameter candidate. Maybe you
should check the log
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B
floats to store the model. That is 64GB. We store the model on the
driver node in the current implementation. So I don't think it would
work. You might try increasing the minCount to decrease the vocabulary
size and reduce the
Thanks for asking! We should improve the documentation. The sample
dataset is actually mimicking the MNIST digits dataset, where the
values are gray levels (0-255). So by dividing by 16, we want to map
it to 16 coarse bins for the gray levels. Actually, there is a bug in
the doc, we should convert
You need to convert DataFrame to RDD, call sampleByKey, and then apply
the schema back to create DataFrame.
val df: DataFrame = ...
val schema = df.schema
val sampledRDD = df.rdd.keyBy(r = r.getAs[Int](0)).sampleByKey(...).values
val sampled = sqlContext.createDataFrame(sampledRDD, schema)
:
MyDenseVectorUDT do exist in the assembly jar and in this example all the
code is in a single file to make sure every thing is included.
On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote:
You should check where MyDenseVectorUDT is defined and whether it was
on the classpath
We use a dense array to store the covariance matrix on the driver
node. So its length is limited by the integer range, which is 65536 *
65536 (actually half). -Xiangrui
On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers
sebastian.alf...@googlemail.com wrote:
Hello,
in order to compute a huge
I'm not sure whether k-means would converge with this customized
distance measure. You can list (weighted) time as a feature along with
coordinates, and then use Euclidean distance. For other supported
distance measures, you can check Derrick's package:
Just curious, what distance measure do you need? -Xiangrui
On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
take a look at this
https://github.com/derrickburns/generalized-kmeans-clustering
Best,
Jao
On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko
For ML applications, the best setting to set the number of partitions
to match the number of cores to reduce shuffle size. You have 3072
partitions but 128 executors, which causes the overhead. For the
MultivariateOnlineSummarizer, we plan to add flags to specify what
need to be computed to reduce
Spark SQL doesn't provide spatial features. Large-scale KNN is usually
combined with locality-sensitive hashing (LSH). This Spark package may
be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash.
-Xiangrui
On Sat, May 9, 2015 at 9:25 PM, Dong Li lid...@lidong.net.cn wrote:
Hello
MLlib only supports Euclidean distance for k-means. You can find
Bregman divergence support in Derrick's package:
http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering.
Which distance measure do you want to use? -Xiangrui
On Tue, May 12, 2015 at 7:23 PM, June
301 - 400 of 464 matches
Mail list logo