Re: Apply Kmeans in partitions

2019-01-30 Thread Apostolos N. Papadopoulos
ues from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and apply Kmeans to every partition. I have tried rdd=mydataframe.rdd.mapPartitions(function, True) test = Kmeans.train(rdd, num_of_centers, "random") but i get an err

Apply Kmeans in partitions

2019-01-30 Thread dimitris plakas
Hello everyone, I have a dataframe which has 5040 rows where these rows are splitted in 5 groups. So i have a column called "Group_Id" which marks every row with values from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and ap

[Spark MLib]: RDD caching behavior of KMeans

2018-07-10 Thread mkhan37
Hi All, I was varying the storage levels of RDD caching in the KMeans program implemented using the MLib library and got some very confusing and interesting results. The base code of the application is from a Benchmark suite named SparkBench <https://github.com/CODAIT/spark-bench> . I c

Understanding the results from Spark's KMeans clustering object

2018-05-18 Thread shubham
if I get similar results. The codes I used for spark and sklearn are in the appendix section towards the end of the post. I have tried to use same values for the parameters in spark and sklearn KMeans model. The following are the results from sklearn and they are as I expected them

Bisecting Kmeans Linkage Matrix Output (Cluster Indices)

2018-03-14 Thread GabeChurch
I have been working on a project to return a Linkage Matrix output from the Spark Bisecting Kmeans Algorithm output so that it is possible to plot the selection steps in a dendogram. I am having trouble returning valid Indices when I use more than 3-4 clusters in the algorithm and am hoping

Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Scott Reynolds
Segel <msegel_had...@hotmail.com> wrote: > Hi, > > Just came across this while looking at the docs on how to use Spark’s > Kmeans clustering. > > Note: This appears to be true in both 2.1 and 2.2 documentation. > > The overview page: > https://spark.apache.org/docs/2.

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Brücke
Hi Ankur, thank you for answering. But my problem is not, that I'm stuck in a local extrema but rather the reproducibility of kmeans. Want I'm trying to achieve is: when the input data and all the parameters stay the same, especially the seed, I want to get the exact same results. Even though

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Yu Zhang
I agree with what Ankur said. The kmeans seeding program ('takeSample' method) runs in parallel, so each partition has its sampling points based on the local data which will cause the 'partition agnostic'. The seeding method is based on Bahmani et al. kmeansII algorithm which gives approximation

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Ankur Srivastava
Hi Christoph, I am not an expert in ML and have not used Spark KMeans but your problem seems to be an issue of local minimum vs global minimum. You should run K-means multiple times with random starting point and also try with multiple values of K (unless you are already sure). Hope this helps

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Bruecke
// generate random data for clustering val randomData = spark.range(1, 1000).withColumn("a", rand(123)).withColumn("b", rand(321)) val vecAssembler = new VectorAssembler().setInputCols(Array("a", "b")).setOutputCol("features") val data = vecAsse

Re: KMeans Clustering is not Reproducible

2017-05-22 Thread Anastasios Zouzias
Hi Christoph, Take a look at this, you might end up having a similar case: http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ If this is not the case, then I agree with you the kmeans should be partitioning agnostic (although I haven't check the code yet). Best

KMeans Clustering is not Reproducible

2017-05-22 Thread Christoph Bruecke
Hi, I’m trying to figure out how to use KMeans in order to achieve reproducible results. I have found that running the same kmeans instance on the same data, with different partitioning will produce different clusterings. Given a simple KMeans run with fixed seed returns different results

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
gt; in Scala 2.11.8. > > 2017-03-14 13:44 GMT+01:00 Julian Keppel <juliankeppel1...@gmail.com>: > >> Hi everybody, >> >> I make some experiments with the Spark kmeans implementation of the new >> DataFrame-API. I compare clustering results of differ

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel <juliankeppel1...@gmail.com>: > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I

[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my

Re: ML version of Kmeans

2017-01-31 Thread Hollin Wilkins
< >> mrajaf...@gmail.com> wrote: >> >>> Hi, >>> >>> I am not able to find predict method on "ML" version of Kmeans. >>> >>> Mllib version has a predict method. KMeansModel.predict(point: Vector) >>> . >>> How to

Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
not able to find predict method on "ML" version of Kmeans. >> >> Mllib version has a predict method. KMeansModel.predict(point: Vector) >> . >> How to predict the cluster point for new vectors in ML version of kmeans ? >> >> Regards, >> Rajesh >> > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >

Re: ML version of Kmeans

2017-01-31 Thread Holden Karau
on "ML" version of Kmeans. > > Mllib version has a predict method. KMeansModel.predict(point: Vector) > . > How to predict the cluster point for new vectors in ML version of kmeans ? > > Regards, > Rajesh > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau

ML version of Kmeans

2017-01-31 Thread Madabhattula Rajesh Kumar
Hi, I am not able to find predict method on "ML" version of Kmeans. Mllib version has a predict method. KMeansModel.predict(point: Vector) . How to predict the cluster point for new vectors in ML version of kmeans ? Regards, Rajesh

PySpark 2: Kmeans The input data is not directly cached

2016-11-03 Thread Zakaria Hili
Hi, I dont know why I receive the message WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. when I try to use Spark Kmeans df_Part = assembler.transform(df_Part) df_Part.cache()while (k<=max_cluster) and (wssse > seu

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
about that, you could just use a single stream for both steps. On Thu, Aug 11, 2016 at 9:14 AM, Ahmed Sadek <don1...@gmail.com> wrote: > Dear All, > > I was wondering why there is training data and testing data in kmeans ? > Shouldn't it be unsupervised learning with just acc

Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Ahmed Sadek
Dear All, I was wondering why there is training data and testing data in kmeans ? Shouldn't it be unsupervised learning with just access to stream data ? I found similar question but couldn't understand the answer. http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering

RE: bisecting kmeans model tree

2016-08-09 Thread Huang, Qian
There seems to be an existing JIRA for this. https://issues.apache.org/jira/browse/SPARK-11664 From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Saturday, July 16, 2016 6:18 PM To: roni <roni.epi...@gmail.com> Cc: user@spark.apache.org Subject: Re: bisecting kmeans model tree Currently

Re: Kmeans dataset initialization

2016-08-06 Thread Tony Lane
Can anyone suggest how I can initialize kmeans structure directly from a dataset of Row On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane <tonylane@gmail.com> wrote: > I have all the data required for KMeans in a dataset in memory > > Standard approach to load this data from a file

Kmeans dataset initialization

2016-08-05 Thread Tony Lane
I have all the data required for KMeans in a dataset in memory Standard approach to load this data from a file is spark.read().format("libsvm").load(filename) where the file has data in the format 0 1:0.0 2:0.0 3:0.0 How do i this from an in-memory dataset already present. Any s

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni <roni.epi...@gmail.

Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni <roni.epi...@gmail.com> wrote: > Hi , > I want to get the bisecting kmeans tree structure to show a dendogram on > the heatmap I am generating based on the hierarch

Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob, The current Streaming KMeans code only updates data which comes in through training (e.g. trainOn), predictOn does not update the model. Cheers, Holden :) P.S. Traffic on the list might be have been bit slower right now because of Canada Day and 4th of July weekend respectively

Re: Working of Streaming Kmeans

2016-07-03 Thread Biplob Biswas
Hi, Can anyone please explain this? Thanks & Regards Biplob Biswas On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas <revolutioni...@gmail.com> wrote: > Hi, > > I wanted to ask a very basic question about the working of Streaming > Kmeans. > > Does the model up

Working of Streaming Kmeans

2016-07-02 Thread Biplob Biswas
Hi, I wanted to ask a very basic question about the working of Streaming Kmeans. Does the model update only when training (i.e. training dataset is used) or does it update on the PredictOnValues function as well for the test dataset? Thanks and Regards Biplob -- View this message

bisecting kmeans model tree

2016-04-21 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni

bisecting kmeans tree

2016-04-20 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -R

Re: Why KMeans with mllib is so slow ?

2016-03-14 Thread Priya Ch
Hi Xi Shen, Changing the initialization step from "kmeans||" to "random" decreased the execution time from 2 hrs to 6 min. However, by default the no.of runs is 1. If I try to set the number of runs to 10, then again see increase in job execution time. How to proceed on thi

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
Hi Chitturi, Please checkout https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int ). I think it is caused by the initialization step. the "kmeans||" method does not initialize dataset in parallel. If your dataset is large

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html > To start a new topic under Apache Spark User List, email > ml-node+s10015

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" <learnings.chitt...@gmail.com> wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(co

Spark Mllib kmeans execution

2016-03-02 Thread Priya Ch
Hi All, I am running k-means clustering algorithm. Now, when I am running the algorithm as - val conf = new SparkConf val sc = new SparkContext(conf) . . val kmeans = new KMeans() val model = kmeans.run(RDD[Vector]) . . . The 'kmeans' object gets created on driver. Now does *kmeans.run() *get

Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
Hi, It looks like Kmeans++ is slow (SPARK-3424<https://issues.apache.org/jira/browse/SPARK-3424>) in the initialisation phase and is local to driver using 1 core only. If I use random, the job completed in 1.5mins compared to 1hr+. Should I move this to the dev list? Regards,

Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-06 Thread Li Ming Tsai
Friday, February 5, 2016 10:56 AM To: user@spark.apache.org Subject: Slowness in Kmeans calculating fastSquaredDistance Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/int

Slowness in Kmeans calculating fastSquaredDistance

2016-02-04 Thread Li Ming Tsai
Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to

Visualization of KMeans cluster in Spark

2016-01-28 Thread Yogesh Vyas
Hi, Is there any way to visualizing the KMeans clusters in spark? Can we connect Plotly with Apache Spark in Java? Thanks, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
Hi Jia, I think the examples you provided is not very suitable to illustrate what driver and executors do, because it's not show the internal implementation of the KMeans algorithm. You can refer the source code of MLlib Kmeans ( https://github.com/apache/spark/blob/master/mllib/src/main/scala

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
store the partitions that don't fit on disk and read them from there when > they are needed. > Actually, it's not necessary to set so large driver memory in your case, > because KMeans use low memory for driver if your k is not very large. > > Cheers > Yanbo > > 2015-12-30 22:20

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
driver memory in your case, because KMeans use low memory for driver if your k is not very large. Cheers Yanbo 2015-12-30 22:20 GMT+08:00 Jia Zou <jacqueline...@gmail.com>: > I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU > cores and 30GB memory. Executor m

Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU cores and 30GB memory. Executor memory is set to 15GB, and driver memory is set to 15GB. The observation is that, when input data size is smaller than 15GB, the performance is quite stable. However, when input data becomes

Clustering KMeans error in 1.5.1

2015-10-16 Thread robin_up
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) - -- Robin Li -- View this message in context:

Re: Distance metrics in KMeans

2015-09-26 Thread Robineast
There is a Spark Package that gives some alternative distance metrics, http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Not used it myself. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books

Re: Distance metrics in KMeans

2015-09-25 Thread sethah
/Distance-metrics-in-KMeans-tp24823p24826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Distance metrics in KMeans

2015-09-25 Thread bobtreacy
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto, Manhattan) with MLlib KMeans? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html Sent from the Apache Spark User List mailing list archive

KMeans Model fails to run

2015-09-23 Thread Soong, Eddie
Hi, Why am I getting this error which prevents my KMeans clustering algorithm to work inside of Spark? I'm trying to run a sample Scala model found in Databricks website on my Cloudera-Spark 1-node local VM. For completeness, the Scala program is as follows: Thx import

Using ML KMeans without hardcoded feature vector creation

2015-09-15 Thread Tóth Zoltán
Hi, I'm wondering if there is a concise way to run ML KMeans on a DataFrame if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet

Kmeans issues and hierarchical clustering

2015-08-28 Thread Robust_spark
Dear All, I am trying to cluster 350k english text phrases (each with 4-20 words) into 50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering results with lower number of features in HashingTF

Re: mllib kmeans produce 1 large and many extremely small clusters

2015-08-11 Thread sooraj
Hi, The issue is very likely to be in the data or the transformations you apply, rather than anything to do with the Spark Kmeans API as such. I'd start debugging by doing a bit of exploratory analysis of the TFIDF vectors. That is, for instance, plot the distribution (histogram) of the TFIDF

mllib kmeans produce 1 large and many extremely small clusters

2015-08-09 Thread farhan
, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1}) Please Help ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html

Re: Kmeans Labeled Point RDD

2015-07-20 Thread plazaster
Has there been any progress on this, I am in the same boat. I posted a similar question to Stack Exchange. http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans

RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. http://stackoverflow.com/a/31528274/2336943 Mohammed -Original Message- From: plazaster [mailto:michaelplaz...@gmail.com] Sent: Sunday, July 19, 2015 11:38 PM To: user@spark.apache.org Subject: Re: Kmeans

[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Unknown Unknown 0/8 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz brk...@gmail.com wrote: Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
, Burak Yavuz brk...@gmail.com wrote: Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13,

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot

Re: KMeans questions

2015-07-02 Thread Feynman Liang
SPARK-7879 https://issues.apache.org/jira/browse/SPARK-7879 seems to address your use case (running KMeans on a dataframe and having the results added as an additional column) On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman eric.d.fried...@gmail.com wrote: In preparing a DataFrame (spark 1.4

KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r = Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my

Re: kmeans broadcast

2015-06-29 Thread Himanshu Mehra
Hi Haviv, have you tried sc.broadcast(model), the broadcast method is a member of sparkContext class. Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kmeans-broadcast-tp23511p23526.html Sent from the Apache Spark User List mailing list

MLIB-KMEANS: Py4JNetworkError: An error occurred while trying to connect to the Java server , on a huge data set

2015-06-18 Thread rogersjeffreyl
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. *My Code is as Follows:* def convert_into_sparse_vector

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman Meethu, Apologies---I was wrong about KMeans supporting an initial set of centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018 If you're interested in submitting a PR, please do! Thanks, Joseph On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW meethu2...@yahoo.co.in

Re: spark mllib kmeans

2015-05-21 Thread Pa Rö
Rabarisoa jaon...@gmail.com wrote: take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko fo...@driesprong.frl wrote: Hi Paul, I would say that it should be possible, but you'll need

Kmeans Labeled Point RDD

2015-05-21 Thread anneywarlord
://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
. After I cluster my data I would like to be able to identify which observations were grouped with each centroid. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html Sent from the Apache Spark User List mailing

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko fo

Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June zhuman.priv

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
instead of maxIterations, which is sort of a bug in the example). If that does not cap the max iterations, then please report it as a bug. To specify the initial centroids, you will need to modify the DenseKMeans example code. Please see the KMeans API docs for details. Good luck, Joseph On Mon

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW
Hi,I think you cant supply an initial set of centroids to kmeans Thanks Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar suman.somasun...@oracle.com wrote: !--#yiv5602900621 _filtered #yiv5602900621 {font-family:Cambria Math;panose-1:2 4 5 3 5 4 6 3 2 4

Restricting the number of iterations in Mllib Kmeans

2015-05-14 Thread Suman Somasundar
Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply

question about customize kmeans distance measure

2015-05-12 Thread June
Dear list, I am new to spark, and I want to use the kmeans algorithm in mllib package. I am wondering whether it is possible to customize the distance measure used by kmeans, and how? Many thanks! June

spark mllib kmeans

2015-05-11 Thread Pa Rö
hi, it is possible to use a custom distance measure and a other data typ as vector? i want cluster temporal geo datas. best regards paul

Re: spark mllib kmeans

2015-05-11 Thread Driesprong, Fokko
Hi Paul, I would say that it should be possible, but you'll need a different distance measure which conforms to your coordinate system. 2015-05-11 14:59 GMT+02:00 Pa Rö paul.roewer1...@googlemail.com: hi, it is possible to use a custom distance measure and a other data typ as vector? i

MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
PM, Jeetendra Gangele gangele...@gmail.com wrote: How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
...@gmail.com wrote: How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans

Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

KMeans takeSample jobs and RDD cached

2015-04-25 Thread podioss
the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Streaming Kmeans usage in java

2015-04-23 Thread Jeetendra Gangele
Do everyone do we have sample example how to use streaming k-means clustering with java. I have seen some example usage in scala. can anybody point me to the java example? regards jeetendra

Re: Mllib kmeans #iteration

2015-04-03 Thread amoners
Have you refer to official document of kmeans on https://spark.apache.org/docs/1.1.1/mllib-clustering.html ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html Sent from the Apache Spark User List mailing list archive

Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
, maxIterations). It uses the kmeans|| initialization algorithm which supposedly to be a faster version of kmeans++ and give better results in general. But I observed that if the k is very large, the initialization step takes a long time. From the CPU utilization chart, it looks like only one thread

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote: Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could

Mllib kmeans #iteration

2015-04-02 Thread podioss
Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
). It uses the kmeans|| initialization algorithm which supposedly to be a faster version of kmeans++ and give better results in general. But I observed that if the k is very large, the initialization step takes a long time. From the CPU utilization chart, it looks like only one thread is working

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
Hi Burak, Unfortunately, I am expected to do my work in HDInsight environment which only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace it with Spark 1.3. I think the problem I am observing is caused by kmeans|| initialization step. I will open another thread to discuss

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It uses the kmeans|| initialization algorithm which supposedly to be a faster version of kmeans

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84
cluster. The cluster has 7 executors, each has 8 cores... If I set k=5000 which is the required value for my task, the job goes on forever... Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
is the required value for my task, the job goes on forever... Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p22153.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the kmeans I want to get the top terms (5 - 10). How do I get that using clusterCenters? full code is here http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html -- View

Re: MLlib/kmeans newbie question(s)

2015-03-09 Thread Xiangrui Meng
correct my thinking if its wrong): This code turns each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common topics). As such, when it checks each tweet to see if models.predict == 1, different

  1   2   >