You need to use maven to include python files. See
https://github.com/apache/spark/pull/1223 . -Xiangrui
On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote:
> I have figured out that building the fat jar with sbt does not seem to
> included the pyspark scripts using the following command:
>
> sbt/sb
If you use Kryo serialier, you need to register mutable.BitSet and Rating:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102
The JIRA was marked resolved because chill resolved the problem in
v0.4.0 and we have this workaro
If Spark is not installed on the client side, you won't be able to
deserialize the model. Instead of serializing the model object, you
may serialize the model weights array and implement predict on the
client side. -Xiangrui
On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu wrote:
> I had the same need
This is a bug. Could you make a JIRA? -Xiangrui
On Sat, Nov 15, 2014 at 3:27 AM, lev wrote:
> Hi,
>
> I'm having trouble using both zipWithIndex and repartition. When I use them
> both, the following action will get stuck and won't return.
> I'm using spark 1.1.0.
>
>
> Those 2 lines work as expe
I think I understand where the bug is now. I created a JIRA
(https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
soon. -Xiangrui
On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng wrote:
> This is a bug. Could you make a JIRA? -Xiangrui
>
> On Sat, Nov 15, 2014 at 3:2
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:
val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()
Thanks for reporting the bug! -Xiangrui
On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng wrote
How many features and how many partitions? You set kmeans_clusters to
1. If the feature dimension is large, it would be really
expensive. You can check the WebUI and see task failures there. The
stack trace you posted is from the driver. Btw, the total memory you
have is 64GB * 10, so you can c
The data is in LIBSVM format. So this line won't work:
values = [float(s) for s in line.split(' ')]
Please use the util function in MLUtils to load it as an RDD of LabeledPoint.
http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
from pyspark.mllib.util import MLUtils
examp
KMeansModel is serializable. So you can use Java serialization, try
sc.parallelize(Seq(model)).saveAsObjectFile(outputDir)
sc.objectFile[KMeansModel](outputDir).first()
We will try to address model export/import more formally in 1.3, e.g.,
https://www.github.com/apache/spark/pull/3062
-Xiangrui
Try building Spark with -Pnetlib-lgpl, which includes the JNI library
in the Spark assembly jar. This is the simplest approach. If you want
to include it as part of your project, make sure the library is inside
the assembly jar or you specify it via `--jars` with spark-submit.
-Xiangrui
On Mon, No
In 1.2, we added streaming k-means:
https://github.com/apache/spark/pull/2942 . -Xiangrui
On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact wrote:
> Thank you Tobias!
>
> On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer wrote:
>>
>> Hi,
>>
>> On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact
>> wro
There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui
On Tue, Nov 25, 2014 at 2:39
It is data-dependent, and hence needs hyper-parameter tuning, e.g.,
grid search. The first batch is certainly expensive. But after you
figure out a small range for each parameter that fits your data,
following batches should be not that expensive. There is an example
from AMPCamp:
http://ampcamp.b
Besides API stability concerns, models constructed directly from users
rather than returned by ALS may not work well. The userFeatures and
productFeatures are both with partitioners so we can perform quick
lookup for prediction. If you save userFeatures and productFeatures
and load them back, it is
Sent out a PR to make the constructor public and leave a note in the
doc: https://github.com/apache/spark/pull/3459 . If you load
userFeatures and productFeatures back and want to make predictions on
individual records, it might be useful to call
partitionBy(...).cache() on the userFeatures and pro
The training RMSE may increase due to regularization. Squared loss
only represents part of the global loss. If you watch the sum of the
squared loss and the regularization, it should be non-increasing.
-Xiangrui
On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen wrote:
> I also modified the example to tr
Hi Bharath,
You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:
https://issues.apache.org/
you can use the default toString method to get the string
representation. if you want to customized, check the indices/values
fields. -Xiangrui
On Fri, Dec 5, 2014 at 7:32 AM, debbie wrote:
> Basic question:
>
> What is the best way to loop through one of these and print their
> components? Conve
If you want to train offline and predict online, you can use the
current LR implementation to train a model and then apply
model.predict on the dstream. -Xiangrui
On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan wrote:
> I am new to spark.
> Lets say i want to develop a machine learning model. which tr
Is it possible that after filtering the feature dimension changed?
This may happen if you use LIBSVM format but didn't specify the number
of features. -Xiangrui
On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak wrote:
> Hi All,
>
>
> I was able to run LinearRegressionwithSGD for a largeer dataset (> 2
Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai wrote:
> You just need to use the latest master code without any configuration
> to get performance improvement from my PR.
>
> Since
Could you post the full stacktrace? It seems to be some recursive call
in parsing. -Xiangrui
On Tue, Dec 9, 2014 at 7:44 PM, wrote:
> Hi
>
>
>
> I am getting Stack overflow Error
>
> Exception in main java.lang.stackoverflowerror
>
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sc
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal
wrote:
>
>
> Hi,
>
>
>
> I am a new bee in spark and scala world
>
>
>
> I have been trying to implement Collaborative filtering using MlLib supplied
> out of the box with Spark and Scala
>
>
>
> I have 2 problems
>
>
>
> 1. The best model was
;ll try out setting a smaller number of item blocks. And
>> yes, I've been following the JIRA for the new ALS implementation. I'll try
>> it out when it's ready for testing. .
>>
>> On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng wrote:
>>>
>>
Hi Jay,
Please try increasing executor memory (if the available memory is more
than 2GB) and reduce numBlocks in ALS. The current implementation
stores all subproblems in memory and hence the memory requirement is
significant when k is large. You can also try reducing k and see
whether the problem
Dear Spark users and developers,
I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install pac
Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui
On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak wrote:
> Hi All,
> I use LIBSVM format to specify my input feature vector, which used 1-based
> index. When I run regression the o/p is 0-indexed based. I have
How big is the dataset you want to use in prediction? -Xiangrui
On Mon, Dec 22, 2014 at 1:47 PM, boci wrote:
> Hi!
>
> I want to try out spark mllib in my spark project, but I got a little
> problem. I have training data (external file), but the real data com from
> another rdd. How can I do that
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering
-Xiangrui
On Tue, D
Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 -> 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See
Hopefully the new pipeline API addresses this problem. We have a code
example here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
-Xiangrui
On Mon, Dec 29, 2014 at 5:22 AM, andy petrella wrote:
> Here is w
Could you post your code? It sounds like a bug. One thing to check is
that wheher you set regType, which is None by default. -Xiangrui
On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan wrote:
> Hi there
>
> We are on mllib 1.1.1, and trying different regularization parameters. We
> noticed that the re
vice?)
>
> b0c1
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>
> On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng wrote:
>>
>> How big is the dataset you want to use in prediction? -Xiangrui
>>
&g
There is an SVD++ implementation in GraphX. It would be nice if you
can compare its performance vs. Mahout. -Xiangrui
On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani wrote:
> hi ,
>
> Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is
> implemented in Mahout from this paper -
It might be hard to do that with spark-submit, because the executor
JVMs may be already up and running before a user runs spark-submit.
You can try to use `System.setProperty` to change the property at
runtime, though it doesn't seem to be a good solution. -Xiangrui
On Fri, Jan 2, 2015 at 6:28 AM,
I created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone
would work on it and make it available in the 1.3 release. -Xiangrui
On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom
wrote:
> Hi,
>
>
>
> I wonder if anyone knows when a python API will be added for Grad
How big is your dataset, and what is the vocabulary size? -Xiangrui
On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen wrote:
> Hi,
>
> When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> usage. Here is the jstack output:
>
> "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
rk-submit script
> Am I correct?
>
>
>
> On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng wrote:
>>
>> It might be hard to do that with spark-submit, because the executor
>> JVMs may be already up and running before a user runs spark-submit.
>> You can try to
Which Spark version are you using? We made this configurable in 1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202
-Xiangrui
On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. wrote:
> Hi,
>I was doing a tests with ALS and I
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789.
In the new pipeline API, we can simply output two columns, one for the
best predicted class, and the other for probabilities or confidence
scores for each class. -Xiangrui
On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li wrote:
> H
Could you attach the executor log? That may help identify the root
cause. -Xiangrui
On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch wrote:
> Hi All,
>
> Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
> local mode and not on distributed mode. Null pointer exception has been
> th
There is some serialization overhead. You can try
https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
. -Xiangrui
On Wed, Jan 7, 2015 at 9:42 AM, rok wrote:
> I have an RDD of SparseVectors and I'd like to calculate the means returning
> a dense vector. I've tried doing
The Julia code is computing the SVD of the Gram matrix. PCA should be
applied to the covariance matrix. -Xiangrui
On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara wrote:
> Hi All,
>
> I tried to do PCA for the Iris dataset
> [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
> [http://spark.a
On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng wrote:
>>
>> Could you attach the executor log? That may help identify the root
>> cause. -Xiangrui
>>
>> On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch
>> wrote:
>> > Hi All,
>> >
>> &
*X ;
>
> Thanks,
> Upul
>
> On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng wrote:
>>
>> The Julia code is computing the SVD of the Gram matrix. PCA should be
>> applied to the covariance matrix. -Xiangrui
>>
>> On Thu, Jan 8, 2015 at 8:27 AM, Upul Band
Do you have more than 2 billion users/products? If not, you can pair
each user/product id with an integer (check RDD.zipWithUniqueId), use
them in ALS, and then join the original bigInt IDs back after
training. -Xiangrui
On Fri, Jan 9, 2015 at 5:12 PM, nishanthps wrote:
> Hi,
>
> The userId's and
"sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out"
How do you guarantee that the two RDDs have the same size?
-Xiangrui
On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
<1wil...@informatik.uni-hamburg.de> wrote:
> Hi Spark community,
>
>
How big is your data? Did you see other error messages from executors?
It seems to me like a shuffle communication error. This thread may be
relevant:
http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E
-Xiangru
gt; required: 8
>
> is there an easy/obvious fix?
>
>
> On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng wrote:
>>
>> There is some serialization overhead. You can try
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
>> .
t;
> Now PCA calculated using Julia and R is identical, but still I can see a
> small
> difference between PCA values given by Spark and other two.
>
> Thanks,
> Upul
>
> On Sat, Jan 10, 2015 at 11:17 AM, Xiangrui Meng wrote:
>>
>> You need to subtract mean values t
ryoException: Buffer overflow. Available: 5,
> required: 8
>
> Just calling colStats doesn't actually compute those statistics, does it? It
> looks like the computation is only carried out once you call the .mean()
> method.
>
>
>
> On Sat, Jan 10, 2015 at 7:04 AM, Xiangru
I don't know the root cause. Could you try including only
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.1"
It should be sufficient because mllib depends on core.
-Xiangrui
On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li wrote:
> Hi,
>
> I am trying to build my own scala project
.
Best,
Xiangrui
On Wed, Jan 14, 2015 at 1:04 PM, Nishanth P S wrote:
> Yes, we are close to having more 2 billion users. In this case what is the
> best way to handle this.
>
> Thanks,
> Nishanth
>
> On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng wrote:
>>
>> Do y
Yes, you can only use RowMatrix.multiply() within the driver. We are
working on distributed block matrices and linear algebra operations on
top of it, which would fit your use cases well. It may take several
PRs to finish. You can find the first one here:
https://github.com/apache/spark/pull/3200 -
Hi Sameer,
Lin (cc'ed) could also give you some updates about Pig on Spark
development on her side.
Best,
Xiangrui
On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak wrote:
> Hi Mayur,
> We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal is
> to get SPROK set up next month. I
Hi Michael,
I can help check the current implementation. Would you please go to
https://spark-project.atlassian.net/browse/SPARK and create a ticket
about this issue with component "MLlib"? Thanks!
Best,
Xiangrui
On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman wrote:
> Hi,
>
> I'm implementing
Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i
x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some
metrics to tell which recommendation is better? -Xiangrui
On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng wrote:
> Hi Michael,
>
> I can help check th
Hi Michael,
Thanks for looking into the details! Computing X first and computing Y
first can deliver different results, because the initial objective
values could differ by a lot. But the algorithm should converge after
a few iterations. It is hard to tell which should go first. After all,
the def
The factor matrix Y is used twice in implicit ALS computation, one to
compute global Y^T Y, and another to compute local Y_i^T C_i Y_i.
-Xiangrui
On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia wrote:
> On Mar 14, 2014, at 5:52 PM, Michael Allman wrote:
>
> I also found that the product and user
Hi Michael,
I made couple changes to implicit ALS. One gives faster construction
of YtY (https://github.com/apache/spark/pull/161), which was merged
into master. The other caches intermediate matrix factors properly
(https://github.com/apache/spark/pull/165). They should give you the
same result a
Sorry, the link was wrong. Should be
https://github.com/apache/spark/pull/131 -Xiangrui
On Tue, Mar 18, 2014 at 10:20 AM, Michael Allman wrote:
> Hi Xiangrui,
>
> I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can
> you explain?
>
> Also, thanks for addressing the issue
Hi Jaonary,
With the current implementation, you need to call Array.slice to make
each row an Array[Double] and cache the result RDD. There is a plan to
support block-wise input data and I will keep you informed.
Best,
Xiangrui
On Tue, Mar 18, 2014 at 2:46 AM, Jaonary Rabarisoa wrote:
> Dear Al
Glad to hear the speed-up. Wish we can improve the implementation
further in the future. -Xiangrui
On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman wrote:
> I just ran a runtime performance comparison between 0.9.0-incubating and your
> als branch. I saw a 1.5x improvement in performance.
>
>
>
>
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes befo
e first iteration. K=50. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng wrote:
>
>> Hi Tsai,
>>
>> Could you share more informati
> Does the size of the input data matters for the example? Currently I have 50M
> rows. What is a reasonable size to demonstrate the capability of Spark?
>
>
>
>
>
> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote:
>
>> K = 50 is certainly a large number for k-m
quot; here is the app/driver/spark-shell?
>
> Thanks!
>
> On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote:
>
>> Number of rows doesn't matter much as long as you have enough workers
>> to distribute the work. K-means has complexity O(n * d * k), where n
>> is
Hi bearrito,
This is a known issue
(https://spark-project.atlassian.net/browse/SPARK-1281) and it should
be easy to fix by switching to a hash partitioner.
CC'ed dev list in case someone volunteers to work on it.
Best,
Xiangrui
On Thu, Mar 27, 2014 at 8:38 PM, bearrito wrote:
> Usage of negati
>From API docs: "Zips this RDD with another one, returning key-value
pairs with the first element in each RDD, second element in each RDD,
etc. Assumes that the two RDDs have the *same number of partitions*
and the *same number of elements in each partition* (e.g. one was made
through a map on the
This is fixed in https://github.com/apache/spark/pull/281. Please try
again with the latest master. -Xiangrui
On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote:
> i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago
> (apr 5) that the "application detail ui" no longer show
Fix storage UI bug
>
>
>
> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers wrote:
>>
>> got it thanks
>>
>>
>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote:
>>>
>>> This is fixed in https://github.com/apache/spark/pull/281. Ple
This is because the tasks associated with re-using the RDD do not
>>> report the RDD's blocks as updated (which is correct). On stage submit,
>>> however, we overwrite any existing
>>>
>>> Author: Andrew Or
>>>
>>> Closes #281 from
After sbt/sbt gen-diea, do not import as an SBT project but choose
"open project" and point it to the spark folder. -Xiangrui
On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen wrote:
> I let IntelliJ read the Maven build directly and that works fine.
> --
> Sean Owen | Director, Data Science | London
>
It was moved to mllib.linalg.distributed.RowMatrix. With RowMatrix,
you can compute column summary statistics, gram matrix, covariance,
SVD, and PCA. We will provide multiplication for distributed matrices,
but not in v1.0. -Xiangrui
On Fri, Apr 11, 2014 at 9:12 PM, wxhsdp wrote:
> Hi, all
> the
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui
On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll wrote:
> I'm trying to understand when I would want to checkpoint an RDD rather than
> just persist to disk.
>
> Every reference I can
sist and
> replicate my RDD to avoid re-computation, if that's my goal. What advantage
> does checkpointing provide over disk persistence with replication?
>
>
> On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng wrote:
>>
>> Checkpoint clears dependencies. You might nee
The doc is for 0.9.1. You are running a later snapshot, which added
sparse vectors. Try LabeledPoint(parts(0).toDouble,
Vectors.dense(parts(1).split(' ').map(x => x.toDouble)). The examples
are updated in the master branch. You can also check the examples
there. -Xiangrui
On Wed, Apr 23, 2014 at 9
If the first partition doesn't have enough records, then it may not
drop enough lines. Try
rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)
It might trigger a job.
Best,
Xiangrui
On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai wrote:
> Hi Chengi,
>
> If you just want to skip first n lines in RDD,
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD sc.paralleli
;
>> What happens, if I have to drop so many records that the number exceeds
>> partition 0.. ??
>> How do i handle that case?
>>
>>
>>
>>
>> On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng wrote:
>>>
>>> If the first partition
PipedRDD is an RDD[String]. If you know how to parse each result line
into (key, value) pairs, then you can call reduce after.
piped.map(x => (key, value)).reduceByKey((v1, v2) => v)
-Xiangrui
On Wed, Apr 23, 2014 at 2:09 AM, zhxfl <291221...@qq.com> wrote:
> Hello,we know Hadoop-streaming is us
Hi bearrito, this issue was fixed by Tor in
https://github.com/apache/spark/pull/407. You can either try the
master branch or wait for the 1.0 release. -Xiangrui
On Fri, Mar 28, 2014 at 12:19 AM, Xiangrui Meng wrote:
> Hi bearrito,
>
> This is a known issue
> (https://spark-project.a
Which spark version are you using? Could you also include the worker
logs? -Xiangrui
On Wed, Apr 23, 2014 at 3:19 PM, Ian Ferreira wrote:
> I am getting this cryptic error running LinearRegressionwithSGD
>
> Data sample
> LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0])
>
> 14/04
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
wrote:
> ./spark-shell: line 153: 17654 Killed
> $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
>
>
> Any ideas?
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John King
wrote:
> I receive this error:
>
> Traceback (most recent call last):
>
> File "", line 1, in
>
> File
> "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/ml
The data array in RDD is passed by reference to jblas, so data copying
in this stage. However, if jblas uses the native interface, there is a
copying overhead. I think jblas uses java implementation for at least
Level 1 BLAS, and calling native interface for Level 2 & 3. -Xiangrui
On Thu, Apr 24,
and mapping. Just
> received this error when trying to classify.
>
>
> On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng wrote:
>>
>> Is your Spark cluster running? Try to start with generating simple
>> RDDs and counting. -Xiangrui
>>
>> On Thu, Apr 24,
Do you mind sharing more code and error messages? The information you
provided is too little to identify the problem. -Xiangrui
On Thu, Apr 24, 2014 at 1:55 PM, John King wrote:
> Last command was:
>
> val model = new NaiveBayes().run(points)
>
>
>
> On Thu, Apr 24, 2014
}
>
>val vector = new SparseVector(2357815, indices.toArray,
> featValues.toArray)
>
>return LabeledPoint(values(0).toDouble, vector)
>
> }
>
>
> val data = sc.textFile("data.txt")
>
> val empty = data
entioned in the error have anything to do with it?
>
>
> On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng wrote:
>>
>> I don't see anything wrong with your code. Could you do points.count()
>> to see how many training examples you have? Also, make sure you don
How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote:
> Which version of mllib are you using? For Spark 1.0, mllib will
> support sparse feature vector which will improve performance a lot
> when computing the distance between points and centroid.
>
> S
partitions and giving driver more ram and see
whether it can help? -Xiangrui
On Sun, Apr 27, 2014 at 3:33 PM, John King wrote:
> I'm already using the SparseVector class.
>
> ~200 labels
>
>
> On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote:
>>
>> How many label
; cache in HDFS.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote:
>>
>> Eve
Hi Diana,
SparkALS is an example implementation of ALS. It doesn't call the ALS
algorithm implemented in MLlib. M, U, and F are used to generate
synthetic data.
I'm updating the examples. In the meantime, you can take a look at the
updated MLlib guide:
http://50.17.120.186:4000/mllib-collaborativ
Hi Chieh-Yen,
Great to see the Spark implementation of LIBLINEAR! We will definitely
consider adding a wrapper in MLlib to support it. Is the source code
on github?
Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache.
Best,
Xiangrui
On Sun, May 11, 2014 at 10:29 AM, Debasish
Those are warning messages instead of errors. You need to add
netlib-java:all to use native BLAS/LAPACK. But it won't work if you
include netlib-java:all in an assembly jar. It has to be a separate
jar when you submit your job. For SGD, we only use level-1 BLAS, so I
don't think native code is call
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui
On Mon, May 12, 2014 at 1:26 PM, Debasish Das wrote:
> Hi,
>
> I see precision and recall but no accuracy in mllib.evaluation.binary.
>
> Is it already under development or it needs to be added ?
>
> Thanks.
> Deb
>
You have a long lineage that causes the StackOverflow error. Try
rdd.checkPoint() and rdd.count() for every 20~30 iterations.
checkPoint can cut the lineage. -Xiangrui
On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote:
> Dear Sparkers:
>
> I am using Python spark of version 0.9.0 to implement so
e RDD when it is materialized & it only
> materializes in the end, then it runs out of stack.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
>
> On Tue, May 13, 2014 at 11:40 AM, Xiangru
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui
On Wed, May 7, 2014 at 9:19 PM, Andrew Ash wrote:
> Hi all,
>
> Is anyone reading and writing to .bz2 files stored in HDFS from Spark with
401 - 500 of 530 matches
Mail list logo