Re: how can I make the sliding window in Spark Streaming driven by data timestamp instead of absolute time

2014-10-17 Thread st553
I believe I have a similar question to this. I would like to process an
offline event stream for testing/debugging. The stream is stored in a CSV
file where each row in the file has a timestamp. I would like to feed this
file into Spark Streaming and have the concept of time be driven by the
timestamp column. Has anyone does this before? I haven't seen anything in
the docs. Would like to know if this is possible in Spark Streaming. Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-make-the-sliding-window-in-Spark-Streaming-driven-by-data-timestamp-instead-of-absolute-tie-tp1755p16704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming

2014-10-14 Thread st553
Have you solved this issue? im also wondering how to stream an existing file.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How could I start new spark cluster with hadoop2.0.2

2014-10-08 Thread st553
Hi,

Were you able to figure out how to choose a specific version? Im having the
same issue. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p15939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to run kmeans after pca?

2014-09-30 Thread st553
Thanks for your response Burak it was very helpful.

I am noticing that if I run PCA before KMeans that the KMeans algorithm will
actually take longer to run than if I had just run KMeans without PCA. I was
hoping that by using PCA first it would actually speed up the KMeans
algorithm.

I have followed the steps you've outlined but Im wondering if I need to
cache/persist the RDD[Vector] rows of the RowMatrix returned after
multiplying. Something like:

val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) 
val cachedRows = newData.rows.persist()
KMeans.run(cachedRows) 
cachedRows.unpersist()

It doesnt seem intuitive to me that a smaller dimensional version of my data
set would take longer for KMeans... unless Im missing something?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark1.0 principal component analysis

2014-09-23 Thread st553
sowen wrote
 it seems that the singular values from the SVD aren't returned, so I don't
 know that you can access this directly

Its not clear to me why these aren't returned? The S matrix would be useful
to determine a reasonable value for K.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p14919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError with basic kmeans

2014-09-17 Thread st553
Not sure if you resolved this but I had a similar issue and resolved it. In
my case, the problem was the ids of my items were of type Long and could be
very large (even though there are only a small number of distinct ids...
maybe a few hundred of them). KMeans will create a dense vector for the
cluster centers so its important that the dimensionality not be huge. I had
to map my ids to a smaller space and it worked fine. The mapping was
something like...
1001223412 - 1
1006591779 - 2
1011232423 - 3
...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-basic-kmeans-tp1651p14472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to run kmeans after pca?

2014-09-17 Thread st553
I would like to reduce the dimensionality of my data before running kmeans.
The problem I'm having is that both RowMatrix.computePrincipalComponents()
and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train()
requires an RDD[Vector]. Does MLlib provide a way to do this conversion?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org