[ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360089#comment-15360089 ]
Cao Manh Dat edited comment on SOLR-9252 at 7/2/16 9:13 AM: ------------------------------------------------------------ Updated patch, I changed the features selection formulation to correct one (https://en.wikipedia.org/wiki/Information_gain_in_decision_trees). Here are the test result of new formulation (https://docs.google.com/spreadsheets/d/1BRjFgZDiJPBT51kggcCznoK0ES1-N-RbOIJaoDT3qgM/edit?usp=sharing). I thinks the patch is ready now. was (Author: caomanhdat): Updated patch, I changed the features selection formulation to correct one (https://en.wikipedia.org/wiki/Information_gain_in_decision_trees). Here are the test result of new formulation (https://docs.google.com/spreadsheets/d/1BRjFgZDiJPBT51kggcCznoK0ES1-N-RbOIJaoDT3qgM/edit#gid=0). I thinks the patch is ready now. > Feature selection and logistic regression on text > ------------------------------------------------- > > Key: SOLR-9252 > URL: https://issues.apache.org/jira/browse/SOLR-9252 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Attachments: SOLR-9252.patch, SOLR-9252.patch, enron1.zip > > > SOLR-9186 come up with a challenges that for each iterative we have to > rebuild the tf-idf vector for each documents. It is costly computation if we > represent doc by a lot of terms. Features selection can help reducing the > computation. > Due to its computational efficiency and simple interpretation, information > gain is one of the most popular feature selection methods. It is used to > measure the dependence between features and labels and calculates the > information gain between the i-th feature and the class labels > (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf). > I confirmed that by running logistics regressions on enron mail dataset (in > which each email is represented by top 100 terms that have highest > information gain) and got the accuracy by 92% and precision by 82%. > This ticket will create two new streaming expression. Both of them use the > same *parallel iterative framework* as SOLR-8492. > {code} > featuresSelection(collection1, q="*:*", field="tv_text", outcome="out_i", > positiveLabel=1, numTerms=100) > {code} > featuresSelection will emit top terms that have highest information gain > scores. It can be combined with new tlogit stream. > {code} > tlogit(collection1, q="*:*", > featuresSelection(collection1, > q="*:*", > field="tv_text", > outcome="out_i", > positiveLabel=1, > numTerms=100), > field="tv_text", > outcome="out_i", > maxIterations=100) > {code} > In the iteration n, the text logistics regression will emit nth model, and > compute the error of (n-1)th model. Because the error will be wrong if we > compute the error dynamically in each iteration. > In each iteration tlogit will change learning rate based on error of previous > iteration. It will increase the learning rate by 5% if error is going down > and It will decrease the learning rate by 50% if error is going up. > This will support use cases such as building models for spam detection, > sentiment analysis and threat detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org