[ https://issues.apache.org/jira/browse/SPARK-21594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph Wang updated SPARK-21594: -------------------------------- Remaining Estimate: 168h Original Estimate: 168h Description: The semi-supervised learning efforts have just started in Spark machine learning library. This is a very important direction for limited and costly labelled data. With the effort, the warm up time for supervised learning can be minimized. One of the key feature is to be able to output probability in the existing machine learning library for selecting the unlablled data by probability including self-training. The algorithm which has a tendency to overfit is particularly useful. For example, multilayer perceptron classifier(MLP) is one of the case. I found this is not possible with MLP(or neural network). This is an inconsistent offering which needs to be improved. thanks Joseph was: My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark. This is an inconsistent offering which needs to be fixed, which is provided by other algorithms in Spark MLlib with Spark Data Frame but not MLP which is related to AI stuff. thanks Joseph > Missing probability output from MutilayerPerceptronClassifier > ------------------------------------------------------------- > > Key: SPARK-21594 > URL: https://issues.apache.org/jira/browse/SPARK-21594 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.2.0 > Environment: SPARK, PySpark,Scala, SparkR > Reporter: Joseph Wang > Original Estimate: 168h > Remaining Estimate: 168h > > The semi-supervised learning efforts have just started in Spark machine > learning library. > This is a very important direction for limited and costly labelled data. > With the effort, the warm up time for supervised learning can be minimized. > One of the key feature is to be able to output probability in the existing > machine learning library for selecting the unlablled data by probability > including self-training. The algorithm which has a tendency to overfit is > particularly useful. For example, multilayer perceptron classifier(MLP) is > one of the case. > I found this is not possible with MLP(or neural network). This is an > inconsistent offering which needs to be improved. > thanks > Joseph -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org