Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
@Nirmal, okay i'll arange it today. @Mahesan Thanks for the suggestion. yes 100 must me too high for some cases. I thought that during 100 iterations most probably it will converge to stable clusters. Thats why I put 100. yes as cases like k = 100 it might be not enough. Thanks and ill try with d

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
@Ashen let's have a code review today, if it's possible. @Srinath Forgot to mention that I've already given some feedback to Ashen, on how he could use Spark transformations effectively in his code. On Tue, Aug 25, 2015 at 4:33 PM, Ashen Weerathunga wrote: > Okay sure. > > On Tue, Aug 25, 2015

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Sinnathamby Mahesan
Hi Ashen Thank you for sharing the results. When I looked at the last column - anomaly data % the best value 99.04% results in for 3 clusters with 100 iterations and the worst case (28.12%) for 100 clusters with 100 iterations. This would happen as k increases (with fixed number of iterations)

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Okay sure. On Tue, Aug 25, 2015 at 3:55 PM, Nirmal Fernando wrote: > Sure. @Ashen, can you please arrange one? > > On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera wrote: > >> Nirmal, Seshika, shall we do a code review? This code should go into ML >> after UI part is done. >> >> Thanks >> Srinat

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
Sure. @Ashen, can you please arrange one? On Tue, Aug 25, 2015 at 2:35 PM, Srinath Perera wrote: > Nirmal, Seshika, shall we do a code review? This code should go into ML > after UI part is done. > > Thanks > Srinath > > On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga wrote: > >> Hi all, >>

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Srinath Perera
Nirmal, Seshika, shall we do a code review? This code should go into ML after UI part is done. Thanks Srinath On Tue, Aug 25, 2015 at 2:20 PM, Ashen Weerathunga wrote: > Hi all, > > This is the source code of the project. > https://github.com/ashensw/Spark-KMeans-fraud-detection > > Best Regard

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Hi all, This is the source code of the project. https://github.com/ashensw/Spark-KMeans-fraud-detection Best Regards, Ashen On Tue, Aug 25, 2015 at 2:00 PM, Ashen Weerathunga wrote: > Thanks all for the suggestions, > > There are few assumptions I have made, > >- Clusters are uniform >

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Thanks all for the suggestions, There are few assumptions I have made, - Clusters are uniform - Fraud data always will be outliers to the normal clusters - Clusters are not intersect with each other - I have given the number of Iterations as 100. So I assume that 100 iterations wil

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Maheshakya Wijewardena
Is there any particular reason why you are putting aside 65% of anomalous data at the evaluation? Since there is an obvious imbalance when the numbers of normal and abnormal cases are taken into account, you will get greater accuracy at the evaluation because a model tends to produce more accurate

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread CD Athuraliya
Hi Ashen, It would be better if you can add the assumptions you make in this process (uniform clusters etc). It will make the process more clear IMO. Regards, CD On Tue, Aug 25, 2015 at 11:39 AM, Nirmal Fernando wrote: > Can we see the code too? > > On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weer

Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Nirmal Fernando
Can we see the code too? On Tue, Aug 25, 2015 at 11:36 AM, Ashen Weerathunga wrote: > Hi all, > > I am currently working on fraud detection project. I was able to cluster > the KDD cup 99 network anomaly detection dataset using apache spark k means > algorithm. So far I was able to achieve 99% a

[Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Ashen Weerathunga
Hi all, I am currently working on fraud detection project. I was able to cluster the KDD cup 99 network anomaly detection dataset using apache spark k means algorithm. So far I was able to achieve 99% accuracy rate from this dataset.The steps I have followed during the process are mentioned below.