[jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text

Joel Bernstein (JIRA) Wed, 27 Jul 2016 14:45:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396444#comment-15396444
 ]


Joel Bernstein edited comment on SOLR-9252 at 7/27/16 9:44 PM:
---------------------------------------------------------------

Ok, here is my thinking with *train* versus *tlogit*

The *train* function will initially map directly to the TextLogitStream. We can 
document that *train* is a text logistic regression model trainer in the first 
release.

As we add more algorithms the *train* function will map to the *TrainStream*. 
The TrainStream won't have any implementations, it will simply be a facade for 
different training algorithms. The TrainStream will have a parameter called 
*algorithm* which it will use to select the stream implementation, such as 
TextLogitStream. The underlying implementation will handle the parameters, all 
the TrainStream will do is instantiate the alogrithm and run it.

Sample syntax:
{code}
train(collection, 
      features(...), 
      algorithm="tlogit", 
      q="*:*", ....)
{code}

We can use the same facade approach for the *classify* and *features* function. 

The documentation can provide provide documentation for calling *train* with 
the different algorithms. 

I like this approach because it provides three very easy to understand 
functions: train, classify and features

It also stops the explosion of functions that would occur when we have multiple 
classify, train and features algorithms.


 









was (Author: joel.bernstein):
Ok, here is my thinking with *train* versus *tlogit*

The *train* function will initially map directly to the TextLogitStream. We can 
document that *train* is a text logistic regression model trainer in the first 
release.

As we add more algorithms the *train* function will map to the *TrainStream*. 
The TrainStream won't have any implementations, it will simply be a facade for 
different training algorithms. The TrainStream will have a parameter called 
*algorithm* which it will use to select the stream implementation, such as 
TextLogitStream. The underlying implementation will handle the parameters, all 
the TrainStream will do is instantiate the alogrithm and run it.

Sample syntax:
{code}
train(collection, 
      features(...), 
      algorithm="tlogit", 
      q="*:*", ....)
{code}

We can use the same facade approach for the *classify* and *features* function. 

The documentation can provide provide documentation for calling *train* with 
the different algorithms. And what the default algorithms are.

I like this approach because it provides three very easy to understand 
functions: train, classify and features


 








> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-9252) Feature selection and logistic regression on text

Reply via email to