Re: how can I make the sliding window in Spark Streaming driven by data timestamp instead of absolute time

2014-10-17 Thread st553
I believe I have a similar question to this. I would like to process an offline event stream for testing/debugging. The stream is stored in a CSV file where each row in the file has a timestamp. I would like to feed this file into Spark Streaming and have the concept of time be driven by the

Re: Spark Streaming

2014-10-14 Thread st553
Have you solved this issue? im also wondering how to stream an existing file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How could I start new spark cluster with hadoop2.0.2

2014-10-08 Thread st553
Hi, Were you able to figure out how to choose a specific version? Im having the same issue. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p15939.html Sent from the Apache Spark User

Re: How to run kmeans after pca?

2014-09-30 Thread st553
Thanks for your response Burak it was very helpful. I am noticing that if I run PCA before KMeans that the KMeans algorithm will actually take longer to run than if I had just run KMeans without PCA. I was hoping that by using PCA first it would actually speed up the KMeans algorithm. I have

Re: spark1.0 principal component analysis

2014-09-23 Thread st553
sowen wrote it seems that the singular values from the SVD aren't returned, so I don't know that you can access this directly Its not clear to me why these aren't returned? The S matrix would be useful to determine a reasonable value for K. -- View this message in context:

Re: OutOfMemoryError with basic kmeans

2014-09-17 Thread st553
Not sure if you resolved this but I had a similar issue and resolved it. In my case, the problem was the ids of my items were of type Long and could be very large (even though there are only a small number of distinct ids... maybe a few hundred of them). KMeans will create a dense vector for the

How to run kmeans after pca?

2014-09-17 Thread st553
I would like to reduce the dimensionality of my data before running kmeans. The problem I'm having is that both RowMatrix.computePrincipalComponents() and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train() requires an RDD[Vector]. Does MLlib provide a way to do this conversion?

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.