[jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text

Joel Bernstein (JIRA) Tue, 19 Jul 2016 15:32:59 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384990#comment-15384990
 ]


Joel Bernstein edited comment on SOLR-9252 at 7/19/16 10:32 PM:
----------------------------------------------------------------

One of the things I've been thinking about is the function names. I think we 
can shorten the featureSelection function to just be *features*.

I think we could change the tlogit function to *train*. So the syntax would 
look like this:

{code}
train(collection1, q="*:*",
      features(collection1, 
               q="*:*",  
               field="tv_text", 
               outcome="out_i", 
               positiveLabel=1, 
               numTerms=100),
      field="tv_text",
      outcome="out_i",
      maxIterations=100)
{code}

In the future both the *features* and the *train* functions can have a 
parameter for setting the algorithm. The default algorithm in the initial 
release will be *information gain* for feature selection, and *logistic 
regression* for training



was (Author: joel.bernstein):
One of the things I've been thinking about is the function names. I think we 
can shorten the featureSelection function to just be *features*.

I think we could change the tlogit function to *train*. So the syntax would 
look like this:

{code}
train(collection1, q="*:*",
      features(collection1, 
               q="*:*",  
               field="tv_text", 
               outcome="out_i", 
               positiveLabel=1, 
               numTerms=100),
      field="tv_text",
      outcome="out_i",
      maxIterations=100)
{code}

In the future both the *features* and the *train* functions can have a 
parameter for setting the algorithm. 


> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text

Reply via email to