[ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394730#comment-15394730
 ] 

Joel Bernstein commented on SOLR-9252:
--------------------------------------

Ok, I have the patch running and it looks great.

I have the following expression running:

{code}
train(training, 
        features(training, q="*:*", featureSet="first", field="body", 
outcome="out_i", numTerms=200), 
        q="*:*", 
        name="model", 
        field="body", 
        outcome="out_i", 
        maxIterations=100)
{code}

In the patch *train* is still the function name in the /stream handler. But we 
can make a final decision on this before committing.

The accuracy seems to be 98% on the Enron training data with this patch. Here 
is the final model:

{code}
{
                        "idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 
2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 
3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 
3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 
4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 
4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 
4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 
4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 
4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 
4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 
4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 
4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 
3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 
4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 
2.198385343572182, 1.5963758750107606, 4.007719957621744],
                        "alpha_d": 7.150861416624748E-4,
                        "terms_ss": ["enron", "2000", "cc", "hpl", "daren", 
"http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", 
"attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", 
"xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", 
"no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", 
"please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", 
"on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", 
"contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", 
"email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", 
"save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", 
"act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", 
"06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", 
"special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", 
"low", "our", "houston", "many", "april", "size", "r", "tap", "lots", 
"product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", 
"ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", 
"05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", 
"health", "site", "quality", "stocks", "link", "featured", "net", 
"international", "most", "investing", "works", "readers", "uncertainties", 
"differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", 
"subscribers", "should", "adobe", "security", "1934", "valium", "brand", 
"visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", 
"assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", 
"states", "projections", "medications", "predictions", "anticipates", 
"deciding", "events", "advice", "now", "com", "browser"],
                        "iteration_i": 100,
                        "weights_ds": [0.9524452699893067, -2.9257423290160225, 
-2.122240862520573, -0.40259380863176036, -1.242508927269482, 
-2.1933952666745924, 0.9119553386109202, -1.3359582128074137, 
-1.1717690853817335, -0.9029380383621088, -1.970576222154978, 
-0.9180539343040344, -2.031736167842155, -1.382820037232718, 
-1.4296530557007743, -1.5015080966872794, -0.852373483913152, 
-0.2883706803921614, -0.2366741375717678, 0.2966401203916763, 
-0.6792566685980972, -0.18912751254722837, 0.10265566994945839, 
-1.0065678789783332, -0.8967357570889625, 0.041722607774742765, 
-0.2832721589409925, -0.400560390908784, -0.6945385025086017, 
-0.8488391208665993, -0.31851465800191403, 1.570768257518063, 
-1.5144615060332418, 0.9411280928801138, 0.738478999511349, 
-0.6875177906594712, -0.47841730767672286, -0.20502227184813, 
0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, 
-0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 
1.0261571991645586, -0.44254206613371744, 0.31955072203529183, 
-0.24171600421157927, -0.632533557090375, 0.774533771979748, 
-1.1164595912116915, -0.2954704188664946, 0.27653823698423186, 
-1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, 
-1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, 
-0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 
0.20475308352558616, -0.2919021960690356, 1.1094392826383312, 
-1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, 
-0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 
0.921918816504445, -0.15711181528461088, -0.3594966291171786, 
-0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 
0.12872616265365205, 1.362140022970902, -0.2699930594417464, 
0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 
0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 
1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 
1.0570966104647033, -1.167541821576636, -0.4428853975686944, 
0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 
0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 
1.1560852477692065, -0.822855520787489, -0.1468595831916683, 
0.9069870716505091, -0.18884872126960675, -0.19213990843838719, 
-0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, 
-0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 
0.20501283264716516, -0.5852130122059844, 0.11807896760332989, 
-1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 
0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 
1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 
0.059129413669497796, -0.49311249434449955, 0.34652229330274653, 
-0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 
0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 
1.0014232005812307, -0.3453736248293833, -0.1121687186012911, 
-0.15547543099631278, 1.0840890597241875, -0.2879034857435273, 
-0.227656977034567, -0.3716602841157388, 0.18007113168986144, 
0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, 
-0.6253022693091732, 0.33155358331572704, 0.9644709831096733, 
-0.19686285814583682, 1.1069098903214452, -0.19597970694899214, 
-0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 
1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 
0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 
0.9772041721907345, 0.2533596337281238, 0.9839657417973666, 
-0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 
1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 
0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 
0.9586701268580604, 1.0000000000000968, 0.9860828147022696, 
-0.32499900116244823, 1.1624049652694368, 0.4966278258894532, 
-0.14840111822378488, 0.15131204240736265, 1.114787005544689, 
1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 
0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 
0.2630059727829917, 0.24199402427272665, 0.2736018381908099, 
-0.7673296746900424, -0.1899398724099395],
                        "field_s": "body",
                        "trueNegative_i": 3570,
                        "falseNegative_i": 35,
                        "falsePositive_i": 75,
                        "error_d": 176.8112932306374,
                        "truePositive_i": 1381,
                        "id": "model_100"
                }
{code}

> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to