[ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373522#comment-15373522 ]
Joel Bernstein edited comment on SOLR-9252 at 7/12/16 7:28 PM: --------------------------------------------------------------- I just reviewed the latest patch. One implementation detail: The terms component also returns the numDocs now that SOLR-9193 has been committed. So you can retrieve the numDocs along with the doc frequencies by adding the terms.stats param. And one question about the use of tf-idf: You're using tf-idf for the doc vectors which seems like a good idea. Is this a typical approach for text regression or is this something you decided to do because we have access to these types of stats in the index? was (Author: joel.bernstein): I just reviewed the latest patch. One implementation detail: The terms component also returns the numDocs now that SOLR-9193 has been committed. So you can retrieve the numDocs along with the doc frequencies. And one question about the use of tf-idf: You're using tf-idf for the doc vectors which seems like a good idea. Is this a typical approach for text regression or is this something you decided to do because we have access to these types of stats in the index? > Feature selection and logistic regression on text > ------------------------------------------------- > > Key: SOLR-9252 > URL: https://issues.apache.org/jira/browse/SOLR-9252 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Assignee: Joel Bernstein > Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, > enron1.zip > > > SOLR-9186 come up with a challenges that for each iterative we have to > rebuild the tf-idf vector for each documents. It is costly computation if we > represent doc by a lot of terms. Features selection can help reducing the > computation. > Due to its computational efficiency and simple interpretation, information > gain is one of the most popular feature selection methods. It is used to > measure the dependence between features and labels and calculates the > information gain between the i-th feature and the class labels > (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf). > I confirmed that by running logistics regressions on enron mail dataset (in > which each email is represented by top 100 terms that have highest > information gain) and got the accuracy by 92% and precision by 82%. > This ticket will create two new streaming expression. Both of them use the > same *parallel iterative framework* as SOLR-8492. > {code} > featuresSelection(collection1, q="*:*", field="tv_text", outcome="out_i", > positiveLabel=1, numTerms=100) > {code} > featuresSelection will emit top terms that have highest information gain > scores. It can be combined with new tlogit stream. > {code} > tlogit(collection1, q="*:*", > featuresSelection(collection1, > q="*:*", > field="tv_text", > outcome="out_i", > positiveLabel=1, > numTerms=100), > field="tv_text", > outcome="out_i", > maxIterations=100) > {code} > In the iteration n, the text logistics regression will emit nth model, and > compute the error of (n-1)th model. Because the error will be wrong if we > compute the error dynamically in each iteration. > In each iteration tlogit will change learning rate based on error of previous > iteration. It will increase the learning rate by 5% if error is going down > and It will decrease the learning rate by 50% if error is going up. > This will support use cases such as building models for spam detection, > sentiment analysis and threat detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org