Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread CD Athuraliya
Hi all,

Residual plot has been added for numerical prediction algorithms. Using
standard chart types as much as possible is better IMO. It will reduce user
confusion in understanding visualizations. I think we need to look for some
standard chart types for classification algorithms (both binary and
multiclass) as well [1].

[1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

Thanks

On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily identified
 using it.

 An Introduction to Statistical Learning book [1] ( page 92-96) contains
 some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification as
 well. i.e: Taking the difference between the actual probability (o or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look
 very intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
  wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the
 dataset against predicted and actual values for test data. But Spark 
 only
 returns predicted and actual values as test results. Right now we use
 random 10,000 data rows for other visualizations and we cannot use same
 data for this visualization since that random 10,000 data does not
 correspond to test data (test data is a subtracted from dataset 
 according
 to the train data fraction at model building stage).

 One option is to persist test data at testing stage, but it can be
 too large for some datasets according to train data fraction. 
 Appreciate if
 you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . 

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread Maheshakya Wijewardena
Nice.

Adding up to charts for classification, I think we need some visualization
method for clustering as well since there's nothing to show after
clustering models are trained. Maybe chart with respect to two selected
attributes.

On Thu, May 28, 2015 at 11:46 AM, CD Athuraliya chathur...@wso2.com wrote:

 Hi all,

 Residual plot has been added for numerical prediction algorithms. Using
 standard chart types as much as possible is better IMO. It will reduce user
 confusion in understanding visualizations. I think we need to look for some
 standard chart types for classification algorithms (both binary and
 multiclass) as well [1].

 [1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

 Thanks

 On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily identified
 using it.

 An Introduction to Statistical Learning book [1] ( page 92-96)
 contains some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification as
 well. i.e: Taking the difference between the actual probability (o or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look
 very intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is 
 running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for
 this process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya 
 chathur...@wso2.com wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the
 dataset against predicted and actual values for test data. But Spark 
 only
 returns predicted and actual values as test results. Right now we use
 random 10,000 data rows for other visualizations and we cannot use 
 same
 data for this visualization since that random 10,000 data does not
 correspond to test data (test data is a subtracted from dataset 
 according
 to the train data fraction at model building stage).

 One option is to persist test data at testing stage, but it can be
 too large for some datasets according to train data fraction. 
 Appreciate if
 you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread CD Athuraliya
Hi Maheshakya,

We'll be adding cluster diagram in model summary for clustering algorithms.
Please suggest if there exist any other useful evaluation metrics.

Thanks

On Thu, May 28, 2015 at 11:58 AM, Maheshakya Wijewardena 
mahesha...@wso2.com wrote:

 Nice.

 Adding up to charts for classification, I think we need some visualization
 method for clustering as well since there's nothing to show after
 clustering models are trained. Maybe chart with respect to two selected
 attributes.

 On Thu, May 28, 2015 at 11:46 AM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 Residual plot has been added for numerical prediction algorithms. Using
 standard chart types as much as possible is better IMO. It will reduce user
 confusion in understanding visualizations. I think we need to look for some
 standard chart types for classification algorithms (both binary and
 multiclass) as well [1].

 [1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

 Thanks

 On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily identified
 using it.

 An Introduction to Statistical Learning book [1] ( page 92-96)
 contains some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com
 wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification
 as well. i.e: Taking the difference between the actual probability (o or 
 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it 
 comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look
 very intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs.
 actual values in test results. Given that there is no mapping between
 random sample data points and test result points. One thing we can do 
 is
 running test separately (using the same model) for sampled data for the
 sole purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for
 this process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya 
 chathur...@wso2.com wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the
 dataset against predicted and actual values for test data. But Spark 
 only
 returns predicted and actual values as test results. Right now we use
 random 10,000 data rows for other visualizations and we cannot use 
 same
 data for this visualization since that random 10,000 data does not
 correspond to test data (test data is a subtracted from dataset 
 according
 to the train data fraction at model building stage).

 One option is to persist test data at testing stage, but it can
 be too large for some datasets according to train data fraction. 
 Appreciate
 if you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: 

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread Maheshakya Wijewardena
Hi CD,

Two of the widely used evaluation metrics are Rand index[1] and mutual
information[2]. In addition, there is Homogeneity, Completeness and
V-measure [3]. One issue with these external indices is that they require
ground truth of cluster assignments. Therefore without the true class
labels, these metrics are not usable. There are several internal indices as
well such as Silhouette Coefficient[4] which do not need ground truth. Some
of those methods are discussed here[5][6][7]. I think the more useful
scenario will be to use internal indices since having ground truth cluster
labels is not always the case.

For visualization, only 2D (or maybe 3D) plots can be used despite there
are large number of features. So available options can be:

   1. Allowing user to choose 2 or 3 features.
   2. Use PCA based dimensionality reduced (to 2 or 3 components) data -
   Here, PCA may need to implemented separately so this option can be quite
   tedious.

It would be nice if the voronoi diagram for the data spread also can be
shown in the same diagram. See [8].
[1] http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index
[2] http://en.wikipedia.org/wiki/Adjusted_mutual_information
[3] http://aclweb.org/anthology/D/D07/D07-1043.pdf
[4] http://en.wikipedia.org/wiki/Silhouette_%28clustering%29
[5]
http://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-without-having-truth-labels
[6] https://web.njit.edu/~yl473/papers/ICDM10CLU.pdf
[7] http://shonen.naun.org/multimedia/UPress/cc/20-463.pdf
[8] http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Best regards.

On Thu, May 28, 2015 at 12:24 PM, CD Athuraliya chathur...@wso2.com wrote:

 Hi Maheshakya,

 We'll be adding cluster diagram in model summary for clustering
 algorithms. Please suggest if there exist any other useful evaluation
 metrics.

 Thanks

 On Thu, May 28, 2015 at 11:58 AM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Nice.

 Adding up to charts for classification, I think we need some
 visualization method for clustering as well since there's nothing to show
 after clustering models are trained. Maybe chart with respect to two
 selected attributes.

 On Thu, May 28, 2015 at 11:46 AM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 Residual plot has been added for numerical prediction algorithms. Using
 standard chart types as much as possible is better IMO. It will reduce user
 confusion in understanding visualizations. I think we need to look for some
 standard chart types for classification algorithms (both binary and
 multiclass) as well [1].

 [1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

 Thanks

 On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com
 wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily
 identified using it.

 An Introduction to Statistical Learning book [1] ( page 92-96)
 contains some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com
 wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for 
 regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to 
 color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification
 as well. i.e: Taking the difference between the actual probability (o or 
 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it 
 comes
 right, then any point lies above 0.5 (or the threshold we used) is 
 wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look
 very intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
  wrote:

 Hi Srinath,


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread CD Athuraliya
Hi Maheshakya,

Thanks for very detailed response. We'll be reusing the cluster diagram we
use in data exploration view to visualize clusters. What we're mostly
missing is some measures about training and resulting model. I will check
the measures you have mentioned. :)

Regards,
CD

On Thu, May 28, 2015 at 2:17 PM, Maheshakya Wijewardena mahesha...@wso2.com
 wrote:

 Hi CD,

 Two of the widely used evaluation metrics are Rand index[1] and mutual
 information[2]. In addition, there is Homogeneity, Completeness and
 V-measure [3]. One issue with these external indices is that they require
 ground truth of cluster assignments. Therefore without the true class
 labels, these metrics are not usable. There are several internal indices as
 well such as Silhouette Coefficient[4] which do not need ground truth. Some
 of those methods are discussed here[5][6][7]. I think the more useful
 scenario will be to use internal indices since having ground truth cluster
 labels is not always the case.

 For visualization, only 2D (or maybe 3D) plots can be used despite there
 are large number of features. So available options can be:

1. Allowing user to choose 2 or 3 features.
2. Use PCA based dimensionality reduced (to 2 or 3 components) data -
Here, PCA may need to implemented separately so this option can be quite
tedious.

 It would be nice if the voronoi diagram for the data spread also can be
 shown in the same diagram. See [8].
 [1] http://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index
 [2] http://en.wikipedia.org/wiki/Adjusted_mutual_information
 [3] http://aclweb.org/anthology/D/D07/D07-1043.pdf
 [4] http://en.wikipedia.org/wiki/Silhouette_%28clustering%29
 [5]
 http://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-without-having-truth-labels
 [6] https://web.njit.edu/~yl473/papers/ICDM10CLU.pdf
 [7] http://shonen.naun.org/multimedia/UPress/cc/20-463.pdf
 [8] http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

 Best regards.

 On Thu, May 28, 2015 at 12:24 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Maheshakya,

 We'll be adding cluster diagram in model summary for clustering
 algorithms. Please suggest if there exist any other useful evaluation
 metrics.

 Thanks

 On Thu, May 28, 2015 at 11:58 AM, Maheshakya Wijewardena 
 mahesha...@wso2.com wrote:

 Nice.

 Adding up to charts for classification, I think we need some
 visualization method for clustering as well since there's nothing to show
 after clustering models are trained. Maybe chart with respect to two
 selected attributes.

 On Thu, May 28, 2015 at 11:46 AM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 Residual plot has been added for numerical prediction algorithms. Using
 standard chart types as much as possible is better IMO. It will reduce user
 confusion in understanding visualizations. I think we need to look for some
 standard chart types for classification algorithms (both binary and
 multiclass) as well [1].

 [1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

 Thanks

 On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com
 wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily
 identified using it.

 An Introduction to Statistical Learning book [1] ( page 92-96)
 contains some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com
 wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for 
 regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the 
 values
 it self), and plot that, against a predictor variable (just like its 
 been
 done atm). We can also add a third variable (categorical feature) to 
 color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification
 as well. i.e: Taking the difference between the actual probability (o 
 or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it 
 comes
 right, then any point lies above 0.5 (or the threshold we used) is 
 wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we 
 should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against 

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-28 Thread Nirmal Fernando
Great work CD!

On Thu, May 28, 2015 at 11:46 AM, CD Athuraliya chathur...@wso2.com wrote:

 Hi all,

 Residual plot has been added for numerical prediction algorithms. Using
 standard chart types as much as possible is better IMO. It will reduce user
 confusion in understanding visualizations. I think we need to look for some
 standard chart types for classification algorithms (both binary and
 multiclass) as well [1].

 [1] http://oobaloo.co.uk/visualising-classifier-results-with-ggplot2

 Thanks

 On Wed, May 27, 2015 at 5:38 AM, Srinath Perera srin...@wso2.com wrote:

 +1 shall we try those?
 On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic
 tool for regression models.
 Especially, non-linearity in regression models can be easily identified
 using it.

 An Introduction to Statistical Learning book [1] ( page 92-96)
 contains some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification as
 well. i.e: Taking the difference between the actual probability (o or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look
 very intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is 
 running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for
 this process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya 
 chathur...@wso2.com wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the
 dataset against predicted and actual values for test data. But Spark 
 only
 returns predicted and actual values as test results. Right now we use
 random 10,000 data rows for other visualizations and we cannot use 
 same
 data for this visualization since that random 10,000 data does not
 correspond to test data (test data is a subtracted from dataset 
 according
 to the train data fraction at model building stage).

 One option is to persist test data at testing stage, but it can be
 too large for some datasets according to train data fraction. 
 Appreciate if
 you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: 

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-26 Thread Srinath Perera
+1 shall we try those?
On 26 May 2015 22:52, Upul Bandara u...@wso2.com wrote:

 +1 for residual plots.

 Though I haven't used it myself Residual Plot  is a useful diagnostic tool
 for regression models.
 Especially, non-linearity in regression models can be easily identified
 using it.

 An Introduction to Statistical Learning book [1] ( page 92-96) contains
 some useful information about residual plots.

 [1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

 On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification as
 well. i.e: Taking the difference between the actual probability (o or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look very
 intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com
 wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the
 dataset against predicted and actual values for test data. But Spark 
 only
 returns predicted and actual values as test results. Right now we use
 random 10,000 data rows for other visualizations and we cannot use same
 data for this visualization since that random 10,000 data does not
 correspond to test data (test data is a subtracted from dataset 
 according
 to the train data fraction at model building stage).

 One option is to persist test data at testing stage, but it can be
 too large for some datasets according to train data fraction. 
 Appreciate if
 you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 *Supun Sethunga*
 Software Engineer
 WSO2, Inc.
 http://wso2.com/
 lean | enterprise | middleware
 Mobile : +94 716546324




 --
 Upul Bandara,
 Associate Technical Lead, WSO2, Inc.,
 Mob: +94 715 468 345.

___
Dev mailing list
Dev@wso2.org

Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-26 Thread CD Athuraliya
Hi,

Plotting predicted and actual values against a feature doesn't look very
intuitive, specially for non-probabilistic models. Please check the
attachments. Any thoughts on making this visualization better?

Thanks

On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




-- 
*CD Athuraliya*
Software Engineer
WSO2, Inc.
lean . enterprise . middleware
Mobile: +94 716288847 94716288847
LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-26 Thread Supun Sethunga
Hi CD,

As it pops up in the offline discussion as well, IMHO, for classifications,
this plot may not be the best option. But for regression, we can actually
use this plot but with a slight modification, that is taking the difference
of the predicted and actual (rather than the values it self), and plot
that, against a predictor variable (just like its been done atm). We can
also add a third variable (categorical feature) to color the points. This
is a standard plot (AKA Residual plot) which is usually use to evaluate
regression models.

One other thing we can try out is, doing the same for classification as
well. i.e: Taking the difference between the actual probability (o or 1)
and the predicted probability, and plot that, and see whether it gives a
better overall picture. Not sure how will it come out though :) If it comes
right, then any point lies above 0.5 (or the threshold we used) is wrongly
classified, and hence we can get a rough idea, on for which values of
x-axis feature, does the points get wrongly classified. I mean, we should
be able to see any pattern, if there exists.

Thanks,
Supun

On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look very
 intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 
 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if 
 you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-26 Thread Upul Bandara
+1 for residual plots.

Though I haven't used it myself Residual Plot  is a useful diagnostic tool
for regression models.
Especially, non-linearity in regression models can be easily identified
using it.

An Introduction to Statistical Learning book [1] ( page 92-96) contains
some useful information about residual plots.

[1]. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf

On Tue, May 26, 2015 at 8:47 PM, Supun Sethunga sup...@wso2.com wrote:

 Hi CD,

 As it pops up in the offline discussion as well, IMHO, for
 classifications, this plot may not be the best option. But for regression,
 we can actually use this plot but with a slight modification, that is
 taking the difference of the predicted and actual (rather than the values
 it self), and plot that, against a predictor variable (just like its been
 done atm). We can also add a third variable (categorical feature) to color
 the points. This is a standard plot (AKA Residual plot) which is usually
 use to evaluate regression models.

 One other thing we can try out is, doing the same for classification as
 well. i.e: Taking the difference between the actual probability (o or 1)
 and the predicted probability, and plot that, and see whether it gives a
 better overall picture. Not sure how will it come out though :) If it comes
 right, then any point lies above 0.5 (or the threshold we used) is wrongly
 classified, and hence we can get a rough idea, on for which values of
 x-axis feature, does the points get wrongly classified. I mean, we should
 be able to see any pattern, if there exists.

 Thanks,
 Supun

 On Tue, May 26, 2015 at 6:08 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi,

 Plotting predicted and actual values against a feature doesn't look very
 intuitive, specially for non-probabilistic models. Please check the
 attachments. Any thoughts on making this visualization better?

 Thanks

 On Fri, May 22, 2015 at 3:27 PM, Srinath Perera srin...@wso2.com wrote:

 yes, rerun using a random sample from test data is OK.

 --Srinath

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com
 wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 
 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be
 too large for some datasets according to train data fraction. Appreciate 
 if
 you can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 *Supun Sethunga*
 Software Engineer
 WSO2, Inc.
 http://wso2.com/
 lean | enterprise | middleware
 Mobile : +94 716546324




-- 
Upul Bandara,
Associate Technical Lead, WSO2, Inc.,
Mob: +94 715 468 345.
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-22 Thread Srinath Perera
Hi CD,

Can we take a random sample from the test data and use that for this
process?

--Srianth

On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




-- 

Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-22 Thread Supun Sethunga

 Can we take a random sample from the test data and use that for this
 process?
 --Srianth


+1

AFAIK, we are doing a similar thing to the ROC curve points too..

Regards,
Supun

On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-22 Thread CD Athuraliya
Hi Srinath,

Still that random sample will not correspond to predicted vs. actual values
in test results. Given that there is no mapping between random sample data
points and test result points. One thing we can do is running test
separately (using the same model) for sampled data for the sole purpose of
visualization. Any other options?

On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




-- 
*CD Athuraliya*
Software Engineer
WSO2, Inc.
lean . enterprise . middleware
Mobile: +94 716288847 94716288847
LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
https://twitter.com/cdathuraliya | Blog http://cdathuraliya.tumblr.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Predicted vs. actuals chart in model summary

2015-05-22 Thread Lahiru Sandaruwan
Hi,

I'm not sure the kind of data set you are looking for. But we have a real
use case of predicting and also the actual data relevant to predicted time,
in Stratos. Load average, memory consumption, and requests in flight are
predicted currently in Stratos, and we use CEP to receive those data.

Thanks

On Fri, May 22, 2015 at 2:43 PM, Supun Sethunga sup...@wso2.com wrote:

 Can we take a random sample from the test data and use that for this
 process?
 --Srianth


 +1

 AFAIK, we are doing a similar thing to the ROC curve points too..

 Regards,
 Supun

 On Fri, May 22, 2015 at 2:28 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi Srinath,

 Still that random sample will not correspond to predicted vs. actual
 values in test results. Given that there is no mapping between random
 sample data points and test result points. One thing we can do is running
 test separately (using the same model) for sampled data for the sole
 purpose of visualization. Any other options?

 On Fri, May 22, 2015 at 2:06 PM, Srinath Perera srin...@wso2.com wrote:

 Hi CD,

 Can we take a random sample from the test data and use that for this
 process?

 --Srianth

 On Fri, May 22, 2015 at 12:00 PM, CD Athuraliya chathur...@wso2.com
 wrote:

 Hi all,

 To implement $subject in ML we need all feature values of the dataset
 against predicted and actual values for test data. But Spark only returns
 predicted and actual values as test results. Right now we use random 10,000
 data rows for other visualizations and we cannot use same data for this
 visualization since that random 10,000 data does not correspond to test
 data (test data is a subtracted from dataset according to the train data
 fraction at model building stage).

 One option is to persist test data at testing stage, but it can be too
 large for some datasets according to train data fraction. Appreciate if you
 can give your comments on this.

 Thanks,
 CD

 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 
 Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
 Site: http://people.apache.org/~hemapani/
 Photos: http://www.flickr.com/photos/hemapani/
 Phone: 0772360902




 --
 *CD Athuraliya*
 Software Engineer
 WSO2, Inc.
 lean . enterprise . middleware
 Mobile: +94 716288847 94716288847
 LinkedIn http://lk.linkedin.com/in/cdathuraliya | Twitter
 https://twitter.com/cdathuraliya | Blog
 http://cdathuraliya.tumblr.com/




 --
 *Supun Sethunga*
 Software Engineer
 WSO2, Inc.
 http://wso2.com/
 lean | enterprise | middleware
 Mobile : +94 716546324

 ___
 Dev mailing list
 Dev@wso2.org
 http://wso2.org/cgi-bin/mailman/listinfo/dev




-- 
--
Lahiru Sandaruwan
Committer and PMC member, Apache Stratos,
Senior Software Engineer,
WSO2 Inc., http://wso2.com
lean.enterprise.middleware

phone: +94773325954
email: lahi...@wso2.com blog: http://lahiruwrites.blogspot.com/
linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev