Re: how can I make the sliding window in Spark Streaming driven by data timestamp instead of absolute time
I believe I have a similar question to this. I would like to process an offline event stream for testing/debugging. The stream is stored in a CSV file where each row in the file has a timestamp. I would like to feed this file into Spark Streaming and have the concept of time be driven by the timestamp column. Has anyone does this before? I haven't seen anything in the docs. Would like to know if this is possible in Spark Streaming. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-make-the-sliding-window-in-Spark-Streaming-driven-by-data-timestamp-instead-of-absolute-tie-tp1755p16704.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming
Have you solved this issue? im also wondering how to stream an existing file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How could I start new spark cluster with hadoop2.0.2
Hi, Were you able to figure out how to choose a specific version? Im having the same issue. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p15939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to run kmeans after pca?
Thanks for your response Burak it was very helpful. I am noticing that if I run PCA before KMeans that the KMeans algorithm will actually take longer to run than if I had just run KMeans without PCA. I was hoping that by using PCA first it would actually speed up the KMeans algorithm. I have followed the steps you've outlined but Im wondering if I need to cache/persist the RDD[Vector] rows of the RowMatrix returned after multiplying. Something like: val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) val cachedRows = newData.rows.persist() KMeans.run(cachedRows) cachedRows.unpersist() It doesnt seem intuitive to me that a smaller dimensional version of my data set would take longer for KMeans... unless Im missing something? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark1.0 principal component analysis
sowen wrote it seems that the singular values from the SVD aren't returned, so I don't know that you can access this directly Its not clear to me why these aren't returned? The S matrix would be useful to determine a reasonable value for K. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p14919.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemoryError with basic kmeans
Not sure if you resolved this but I had a similar issue and resolved it. In my case, the problem was the ids of my items were of type Long and could be very large (even though there are only a small number of distinct ids... maybe a few hundred of them). KMeans will create a dense vector for the cluster centers so its important that the dimensionality not be huge. I had to map my ids to a smaller space and it worked fine. The mapping was something like... 1001223412 - 1 1006591779 - 2 1011232423 - 3 ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-basic-kmeans-tp1651p14472.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to run kmeans after pca?
I would like to reduce the dimensionality of my data before running kmeans. The problem I'm having is that both RowMatrix.computePrincipalComponents() and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train() requires an RDD[Vector]. Does MLlib provide a way to do this conversion? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Categorical Features for K-Means Clustering
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org