[ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385955#comment-15385955
 ] 

Joel Bernstein edited comment on SOLR-9252 at 7/20/16 2:39 PM:
---------------------------------------------------------------

This is part of the larger ticket SOLR-9258, which will provide more context.

Here are some specifics about this ticket: 

Logistic regression is a machine learning classification algorithm.

It's binary, so it's used to determine if something belongs to a class or not.

With logistic regression you train a model using a training data set. And then 
use that model to classify other documents. 

This ticket trains a logistic regression model on text. So it builds a model 
based on the terms in the documents. New documents can then be classified based 
on the terms in the documents.

The terms in the document are known as *features*. 

The first step in the process is feature selection. Which is to select the 
important terms from the training set that will be used to build the model. 
This ticket uses an algorithm called Information Gain to select the features.

The next step is to train a model based on those features. This ticket uses 
Stochastic Gradient Descent to train a logistic regression model over the 
training set. Stochastic Gradient Descent is an iterative approach.

Both the features and the model can then be stored in a SolrCloud collection.













was (Author: joel.bernstein):
This is part of the larger ticket SOLR-9258, which will provide more context.

Here are some specifics about this ticket: 

Logistic regression is a machine learning classification algorithm.

It's binary, so it's used to determine if something belongs to a class or not.

With logistic regression you train a model using a training data set. And then 
use that model to classify other documents. 

This ticket trains a logistic regression model on text. So it builds a model 
based on the terms in the documents. New documents can then be classified based 
on the terms in the documents.

The terms in the document are known as *features*. 

The first step in the in process is feature selection. Which is to select the 
important terms from the training set that will be used to build the model. 
This ticket uses an algorithm called Information Gain to select the features.

The next step is to train a model based on those features. This ticket uses 
Stochastic Gradient Descent to train a logistic regression model over the 
training set. Stochastic Gradient Descent is an iterative approach.

Both the features and the model can then be stored in a SolrCloud collection.












> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to