Re: [Dev] [ML] Update - Deeplearning Integration to WSO2-ml

2015-07-15 Thread Sinnathamby Mahesan
Hi Thushan
thank you for sending the attachments.
I am just wondering why I see many red-dots in the graphs:
For example, for iris data set, oaccroding to the table only 3 were found
incorrectly predicted
whereas the scatter diagram shows many reds  as well as greens.
Enlighten me if the way I see is wrong.
:-)
Regards
Mahesan

On 13 July 2015 at 07:14, Thushan Ganegedara  wrote:

> Hi all,
>
> I have integrated H-2-O deeplearning to WSO2-ml successfully. Following
> are the stats on 2 tests conducted (screenshots attached).
>
> Iris dataset - 93.62% Accuracy
> MNIST (Small) dataset - 94.94% Accuracy
>
> However, there were few unusual issues that I had to spend lot of time to
> identify.
>
> *FrameSplitter does not work for any value other than 0.5. Any value other
> than 0.5, the following error is returned*
> (Frame splitter is used to split trainingData to train and valid sets)
> barrier onExCompletion for
> hex.deeplearning.DeepLearning$DeepLearningDriver@25e994ae
> ​java.lang.RuntimeException: java.lang.RuntimeException:
> java.lang.NullPointerException
> at
> hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:382)​
>
> *​DeepLearningModel.score(double[] vec) method doesn't work. *
> The predictions obtained with ​score(Frame f) and score(double[] v) is
> shown below.
>
> *Actual, score(Frame f), score(double[] v)*
> ​0.0, 0.0, 1.0
> 1.0, 1.0, 2.0
> 2.0, 2.0, 2.0
> 2.0, 1.0, 2.0
> 1.0, 1.0, 2.0
>
> As you can see, score(double[] v) is quite poor.
>
> After fixing above issues, everything seems to be working fine at the
> moment.
>
> However, the I've a concern regarding the following method in
> view-model.jag -> function
> drawPredictedVsActualChart(testResultDataPointsSample)
>
> var actual = testResultDataPointsSample[i].predictedVsActual.actual;
> var predicted =
> testResultDataPointsSample[i].predictedVsActual.predicted;
> var labeledPredicted = labelPredicted(predicted, 0.5);
>
> if(actual == labeledPredicted) {
> predictedVsActualPoint[2] = 'Correct';
> }
> else {
> predictedVsActualPoint[2] = 'Incorrect';
> }
>
> why does it compare the *actual and labeledPredicted* where it should be
> comparing *actual and predicted*?
>
> Also, the *Actual vs Predicted graph for MNIST show the axis in "Meters" 
> *(mnist.png)
> which doesn't make sense. I'm still looking into this.
>
> Thank you
>
>
>
> --
> Regards,
>
> Thushan Ganegedara
> School of IT
> University of Sydney, Australia
>
> ___
> Dev mailing list
> Dev@wso2.org
> http://wso2.org/cgi-bin/mailman/listinfo/dev
>
>


-- 
~~
Sinnathamby Mahesan



~~
~~
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


[Dev] Testing the model

2015-07-15 Thread Sinnathamby Mahesan
Thanks to Nirmal and Chathirike
for demonstrating ML
I am just curious about the model testing:
We create a project for defined data set
Design a model with a training set
Then after testing the model (actual vs predicted)
and we download the model.

Now my question is this:
Is it possible to use the same (downloaded ) model to do further testing?
Or do we have to create again (of course if we haven't deleted the project,
we can still create the model, that is a different scenario.)
:-)
Regards
Mahesan
~~
Sinnathamby Mahesan



~~
~~
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] WSO2 Committers += CD Athuraliya

2015-07-31 Thread Sinnathamby Mahesan
Congratulations CD!

Best Wishes
Mahesan

On 31 July 2015 at 14:04, Nirmal Fernando  wrote:

> Hi All,
>
> It's my pleasure to announce *CD Athuraliya* as a *WSO2 Committer*. He
> has been a key contributor to the *WSO2 Machine Learner *Product and in
> recognition of his excellent work, he had been voted as a WSO2 Committer.
>
> Congratulations CD and keep up the good work!
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>
> ___
> Dev mailing list
> Dev@wso2.org
> http://wso2.org/cgi-bin/mailman/listinfo/dev
>
>


-- 
~~
Sinnathamby Mahesan



~~
~~
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Spark K-means clustering on KDD cup 99 dataset

2015-08-25 Thread Sinnathamby Mahesan
Hi Ashen
Thank you for sharing the results.
When I looked at the last column - anomaly data %
the best value 99.04% results in for 3 clusters with 100 iterations
and
the worst case   (28.12%) for 100 clusters with 100 iterations.

This would happen as k increases (with fixed number of iterations)

As I understand,
for some k,   100 iterations may be too many and
for some other k,   100 may be not enough(?),
(with all fixed number of data points)

You could limit the number of iterations by asserting a condition
so that it iterates until there is no-change in the centroids.
(You could try and see whether it makes any changes  on the results)

Perhaps 100 clusters may not be necessary.

What percentage of data does points fall in each cluster?
It may be helpful to decide on  the number of clusters.

Good Luck
=
Mahesan

On 25 August 2015 at 11:36, Ashen Weerathunga  wrote:

> Hi all,
>
> I am currently working on fraud detection project. I was able to cluster
> the KDD cup 99 network anomaly detection dataset using apache spark k means
> algorithm. So far I was able to achieve 99% accuracy rate from this
> dataset.The steps I have followed during the process are mentioned below.
>
>- Separate the dataset into two parts (normal data and anomaly data)
>by filtering the label
>- Splits each two parts of data as follows
>   - normal data
>   - 65% - to train the model
>  - 15% - to optimize the model by adjusting hyper parameters
>  - 20% - to evaluate the model
>   - anomaly data
>  - 65% - no use
>  - 15% - to optimize the model by adjusting hyper parameters
>  - 20% - to evaluate the model
>   - Prepossess the dataset
>   - Drop out non numerical features since k means can only handle
>   numerical values
>   - Normalize all the values to 1-0 range
>   - Cluster the 65% of normal data using Apache spark K means and
>build the model (15% of both normal and anomaly data were used to tune the
>hyper parameters such as k, percentile etc. to get an optimized model)
>- Finally evaluate the model using 20% of both normal and anomaly data.
>
> Method of identifying a fraud as follows,
>
>- When a new data point comes, get the closest cluster center by using
>k means predict function.
>- I have calculate 98th percentile distance for each cluster. (98 was
>the best value I got by tuning the model with different values)
>- Then I checked whether the distance of new data point with the given
>cluster center is less than or grater than the 98th percentile of that
>cluster. If it is less than the percentile it is considered as a normal
>data. If it is grater than the percentile it is considered as a fraud since
>it is in outside the cluster.
>
> Our next step is to integrate this feature to ML product and try out it
> with more realistic dataset. A summery of results I have obtained using
> 98th percentile during the process is attached with this.
>
>
> https://docs.google.com/a/wso2.com/spreadsheets/d/1E5fXk9CM31QEkyFCIEongh8KAa6jPeoY7OM3HraGPd4/edit?usp=sharing
>
> Thanks and Regards,
> Ashen
> --
> *Ashen Weerathunga*
> Software Engineer - Intern
> WSO2 Inc.: http://wso2.com
> lean.enterprise.middleware
>
> Email: as...@wso2.com
> Mobile: +94 716042995 <94716042995>
> LinkedIn:
> *http://lk.linkedin.com/in/ashenweerathunga
> <http://lk.linkedin.com/in/ashenweerathunga>*
>



-- 
~~
Sinnathamby Mahesan



~~
~~
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread Sinnathamby Mahesan
Dear Ashen
 Sensitivity  - in view of reducing the false negative
Precision - in view of reducing the false positive

F1 score combines both as the harmonic mean of precision and sensitivity

That's why F1 is chosen normally and is simple  (2TP / (2TP + FN + FP))



By the way, which you consider is True positive
(a) Anomaly  - Anomaly
or
(b) Normal - Normal

I think case (a) is more suited to your with regard to your objective.

Or If you have trouble in choosing which way:

You could consider Accuracy (Acc) which is somewhat similar to F1, but
gives same weight to TP and TN
Acc= ( ( TP + TN) / (TP + TN + FN + FP))



= Good Luck




On 16 September 2015 at 15:35, Ashen Weerathunga  wrote:

> Hi all,
>
> I am currently doing the integration of anomaly detection feature for ML.
> I have a problem of choosing the best accuracy measure for the model. I can
> get the confusion matrix which consists of true positives, true negatives,
> false positives and false negatives. There are few different measures such
> as sensitivity, accuracy, F1 score, etc. So what will be the best measure
> to give as the model accuracy for anomaly detection model.
>
> [1] <https://en.wikipedia.org/wiki/Sensitivity_and_specificity>Some
> details about those measures.
>
> Terminology and derivations
> from a confusion matrix <https://en.wikipedia.org/wiki/Confusion_matrix> true
> positive (TP)eqv. with hittrue negative (TN)eqv. with correct rejectionfalse
> positive (FP)eqv. with false alarm
> <https://en.wikipedia.org/wiki/False_alarm>, Type I error
> <https://en.wikipedia.org/wiki/Type_I_error>false negative (FN)eqv. with
> miss, Type II error <https://en.wikipedia.org/wiki/Type_II_error>
> --
> sensitivity <https://en.wikipedia.org/wiki/Sensitivity_%28test%29> or
> true positive rate (TPR)eqv. with hit rate
> <https://en.wikipedia.org/wiki/Hit_rate>, recall
> <https://en.wikipedia.org/wiki/Information_retrieval#Recall>[image:
> \mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})]
> specificity <https://en.wikipedia.org/wiki/Specificity_%28tests%29> (SPC)
> or true negative rate[image: \mathit{SPC} = \mathit{TN} / N = \mathit{TN}
> / (\mathit{TN}+\mathit{FP})]precision
> <https://en.wikipedia.org/wiki/Information_retrieval#Precision> or positive
> predictive value <https://en.wikipedia.org/wiki/Positive_predictive_value>
> (PPV)[image: \mathit{PPV} = \mathit{TP} / (\mathit{TP} + \mathit{FP})]negative
> predictive value <https://en.wikipedia.org/wiki/Negative_predictive_value>
> (NPV)[image: \mathit{NPV} = \mathit{TN} / (\mathit{TN} + \mathit{FN})]
> fall-out <https://en.wikipedia.org/wiki/Information_retrieval#Fall-out>
> or false positive rate <https://en.wikipedia.org/wiki/False_positive_rate>
> (FPR)[image: \mathit{FPR} = \mathit{FP} / N = \mathit{FP} / (\mathit{FP}
> + \mathit{TN}) = 1-\mathit{SPC}]false negative rate
> <https://en.wikipedia.org/wiki/False_negative_rate> (FNR)[image:
> \mathit{FNR} = \mathit{FN} / (\mathit{TP} + \mathit{FN}) = 
> 1-\mathit{TPR}]false
> discovery rate <https://en.wikipedia.org/wiki/False_discovery_rate> 
> (FDR)[image:
> \mathit{FDR} = \mathit{FP} / (\mathit{TP} + \mathit{FP}) = 1 - \mathit{PPV}]
> --
> accuracy <https://en.wikipedia.org/wiki/Accuracy> (ACC)[image:
> \mathit{ACC} = (\mathit{TP} + \mathit{TN}) / (\mathit{TP} + \mathit{FP} +
> \mathit{FN} + \mathit{TN})]F1 score
> <https://en.wikipedia.org/wiki/F1_score>is the harmonic mean
> <https://en.wikipedia.org/wiki/Harmonic_mean#Harmonic_mean_of_two_numbers>
> of precision
> <https://en.wikipedia.org/wiki/Information_retrieval#Precision> and
> sensitivity <https://en.wikipedia.org/wiki/Sensitivity_%28test%29>[image:
> \mathit{F1} = 2 \mathit{TP} / (2 \mathit{TP} + \mathit{FP} + 
> \mathit{FN})]Matthews
> correlation coefficient
> <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient> (MCC)[image:
> \frac{ \mathit{TP} \times \mathit{TN} - \mathit{FP} \times \mathit{FN} }
> {\sqrt{ (\mathit{TP}+\mathit{FP}) ( \mathit{TP} + \mathit{FN} ) (
> \mathit{TN} + \mathit{FP} ) ( \mathit{TN} + \mathit{FN} ) } 
> }]Informedness[image:
> \mathit{TPR} + \mathit{SPC} - 1]Markedness
> <https://en.wikipedia.org/wiki/Markedness>[image: \mathit{PPV} +
> \mathit{NPV} - 1]
>
> *Sources: Fawcett (2006) and Powers (2011).*[1]
> <https://en.wikipedia.org/wiki/Sensitivity_and_specificity#cite_note-Fawcett2006-1>
> [2]
> <https://en.wikipedia.org/wiki/Sensitivity_and_specificity#cite_note-Powers2011-2>
>
> Thanks