Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel <juliankeppel1...@gmail.com>: > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I

[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my

[Spark DataFrames/Streaming]: Bad performance with window function in streaming job

2017-01-16 Thread Julian Keppel
Hi, I use Spark 2.0.2 and want to do the following: I extract features in a streaming job and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-22 Thread Julian Keppel
; On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel > <juliankeppel1...@gmail.com> wrote: > > Hello, > > > > I use Spark 2.0.2 with Kafka integration 0-8. The Kafka version is > 0.10.0.1 > > (Scala 2.11). I read data from Kafka with the direct approach. The > compl

Re: using StreamingKMeans

2016-11-21 Thread Julian Keppel
I do research in anomaly detection with methods of machine learning at the moment. And currently I do kmeans clustering, too in an offline learning setting. In further work we want to compare the two paradigms of offline and online learning. I would like to share some thoughts on this disscussion.

Kafka direct approach,App UI shows wrong input rate

2016-11-18 Thread Julian Keppel
Hello, I use Spark 2.0.2 with Kafka integration 0-8. The Kafka version is 0.10.0.1 (Scala 2.11). I read data from Kafka with the direct approach. The complete infrastructure runs on Google Container Engine. I wonder why the corresponding application UI says the input rate is zero records per

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Julian Keppel
Okay, thank you! Can you say, when this feature will be released? 2016-10-13 16:29 GMT+02:00 Cody Koeninger : > As Sean said, it's unreleased. If you want to try it out, build spark > > http://spark.apache.org/docs/latest/building-spark.html > > The easiest way to include

Re: Sharing object/state accross transformations

2015-12-06 Thread Julian Keppel
Yes, but what they do is to only add new elements to a state which is passed as parameter. But my problem is that my "counter" (the hyperloglog object) comes from outside and is not passed to the function. So i have to track the state of this "external" hll object accross the whole lifecycle of