Re: alerting system with Solr's Streaming Expressions
Hello Joel, I took a bigger trainingSet around 200K documents (amazon reviews) and it worked out well. I verified the feature terms extracted and classify function was able to output correct probability of reviews being negative or positive. Big thanks for adding this. I wonder what you have next to implement more towards NLU in Solr where queries like "average revenue in last quarter" etc. can be converted to streaming functions to return appropriate results. Thanks, Susheel On Thu, Feb 9, 2017 at 11:23 AM, Susheel Kumarwrote: > got it, Thanks, Joel. > > On Thu, Feb 9, 2017 at 11:17 AM, Susheel Kumar > wrote: > >> I increased from 250 to 2500 and 100 to 1000 when did't get expected >> result. Let me put more examples. >> >> Thanks, >> Susheel >> >> On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein >> wrote: >> >>> A few things that I see right off: >>> >>> 1) 2500 terms is too many. I was testing with 100-250 terms >>> 2) 1000 iterations is to high. If the model hasn't converged by 100 >>> iterations it's likely not going to converge. >>> 3) You're going to need more examples. You may want to run features first >>> and see what it selects. Then you need multiple examples for each >>> feature. >>> I was testing with the enron ham/spam data set. It would be good to >>> download that dataset and see what that looks like. >>> >>> Joel Bernstein >>> http://joelsolr.blogspot.com/ >>> >>> On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar >>> wrote: >>> >>> > Hello Joel, >>> > >>> > Here is the final iteration in json format. >>> > >>> > https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 >>> > >>> > Below is the expression used >>> > >>> > update(models, >>> > batchSize="50", >>> > train(trainingSet, >>> > features(trainingSet, >>> > q="*:*", >>> > featureSet="threatFeatures", >>> > field="body_txt", >>> > outcome="out_i", >>> > numTerms=2500), >>> > q="*:*", >>> > name="threatModel", >>> > field="body_txt", >>> > outcome="out_i", >>> > maxIterations="1000")) >>> > >>> > I just have 16 documents with 8+ve and 8-ves. The field which contains >>> the >>> > feedback is body_txt (text_general type) >>> > >>> > Thanks for looking. >>> > >>> > >>> > >>> > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein >>> wrote: >>> > >>> > > Can you post the final iteration of the model? >>> > > >>> > > Also the expression you used to train the model? >>> > > >>> > > How much training data do you have? Ho many positive examples and >>> > negatives >>> > > examples? >>> > > >>> > > Joel Bernstein >>> > > http://joelsolr.blogspot.com/ >>> > > >>> > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar >> > >>> > > wrote: >>> > > >>> > > > Hello, >>> > > > >>> > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we >>> can >>> > > > classify positive & negative feedbacks using streaming expressions. >>> > All >>> > > > works but end result where probability_d result of classify >>> expression >>> > > > gives similar results for positive / negative feedback. See below >>> > > > >>> > > > What I may be missing here. Do i need to put more data in >>> training set >>> > > or >>> > > > something else? >>> > > > >>> > > > >>> > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], >>> > > > "score_d": 2.1892474120319667, "id": "6", "probability_d": >>> > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], >>> "score_d": >>> > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 >>> }, { >>> > > > "body_txt": [ "This company rewards its employees, but you should >>> only >>> > > work >>> > > > here if you truly love sales. The stress of the job can get to you >>> and >>> > > they >>> > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", >>> > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance >>> for >>> > > > advancement with that company every year I was there it got worse I >>> > don't >>> > > > know if all branches of adp but Florence organization was turn over >>> > rate >>> > > > would be higher if it was for temp workers" ], "score_d": >>> > > > 5.288898825826228, "id": "3", "probability_d": 0.9956 >>> }, { >>> > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. >>> The >>> > team >>> > > > that works there are professional and dedicated individuals. The >>> level >>> > of >>> > > > loyalty and dedication is impressive" ], "score_d": >>> 2.5303947056922937, >>> > > > "id": "2", "probability_d": 0.990430778418 }, >>> > >
Re: alerting system with Solr's Streaming Expressions
got it, Thanks, Joel. On Thu, Feb 9, 2017 at 11:17 AM, Susheel Kumarwrote: > I increased from 250 to 2500 and 100 to 1000 when did't get expected > result. Let me put more examples. > > Thanks, > Susheel > > On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein > wrote: > >> A few things that I see right off: >> >> 1) 2500 terms is too many. I was testing with 100-250 terms >> 2) 1000 iterations is to high. If the model hasn't converged by 100 >> iterations it's likely not going to converge. >> 3) You're going to need more examples. You may want to run features first >> and see what it selects. Then you need multiple examples for each feature. >> I was testing with the enron ham/spam data set. It would be good to >> download that dataset and see what that looks like. >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar >> wrote: >> >> > Hello Joel, >> > >> > Here is the final iteration in json format. >> > >> > https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 >> > >> > Below is the expression used >> > >> > update(models, >> > batchSize="50", >> > train(trainingSet, >> > features(trainingSet, >> > q="*:*", >> > featureSet="threatFeatures", >> > field="body_txt", >> > outcome="out_i", >> > numTerms=2500), >> > q="*:*", >> > name="threatModel", >> > field="body_txt", >> > outcome="out_i", >> > maxIterations="1000")) >> > >> > I just have 16 documents with 8+ve and 8-ves. The field which contains >> the >> > feedback is body_txt (text_general type) >> > >> > Thanks for looking. >> > >> > >> > >> > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein >> wrote: >> > >> > > Can you post the final iteration of the model? >> > > >> > > Also the expression you used to train the model? >> > > >> > > How much training data do you have? Ho many positive examples and >> > negatives >> > > examples? >> > > >> > > Joel Bernstein >> > > http://joelsolr.blogspot.com/ >> > > >> > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar >> > > wrote: >> > > >> > > > Hello, >> > > > >> > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can >> > > > classify positive & negative feedbacks using streaming expressions. >> > All >> > > > works but end result where probability_d result of classify >> expression >> > > > gives similar results for positive / negative feedback. See below >> > > > >> > > > What I may be missing here. Do i need to put more data in training >> set >> > > or >> > > > something else? >> > > > >> > > > >> > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], >> > > > "score_d": 2.1892474120319667, "id": "6", "probability_d": >> > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": >> > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 >> }, { >> > > > "body_txt": [ "This company rewards its employees, but you should >> only >> > > work >> > > > here if you truly love sales. The stress of the job can get to you >> and >> > > they >> > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", >> > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance >> for >> > > > advancement with that company every year I was there it got worse I >> > don't >> > > > know if all branches of adp but Florence organization was turn over >> > rate >> > > > would be higher if it was for temp workers" ], "score_d": >> > > > 5.288898825826228, "id": "3", "probability_d": 0.9956 >> }, { >> > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The >> > team >> > > > that works there are professional and dedicated individuals. The >> level >> > of >> > > > loyalty and dedication is impressive" ], "score_d": >> 2.5303947056922937, >> > > > "id": "2", "probability_d": 0.990430778418 }, >> > > > >> > > >> > >> > >
Re: alerting system with Solr's Streaming Expressions
I increased from 250 to 2500 and 100 to 1000 when did't get expected result. Let me put more examples. Thanks, Susheel On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernsteinwrote: > A few things that I see right off: > > 1) 2500 terms is too many. I was testing with 100-250 terms > 2) 1000 iterations is to high. If the model hasn't converged by 100 > iterations it's likely not going to converge. > 3) You're going to need more examples. You may want to run features first > and see what it selects. Then you need multiple examples for each feature. > I was testing with the enron ham/spam data set. It would be good to > download that dataset and see what that looks like. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar > wrote: > > > Hello Joel, > > > > Here is the final iteration in json format. > > > > https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 > > > > Below is the expression used > > > > update(models, > > batchSize="50", > > train(trainingSet, > > features(trainingSet, > > q="*:*", > > featureSet="threatFeatures", > > field="body_txt", > > outcome="out_i", > > numTerms=2500), > > q="*:*", > > name="threatModel", > > field="body_txt", > > outcome="out_i", > > maxIterations="1000")) > > > > I just have 16 documents with 8+ve and 8-ves. The field which contains > the > > feedback is body_txt (text_general type) > > > > Thanks for looking. > > > > > > > > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein > wrote: > > > > > Can you post the final iteration of the model? > > > > > > Also the expression you used to train the model? > > > > > > How much training data do you have? Ho many positive examples and > > negatives > > > examples? > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar > > > wrote: > > > > > > > Hello, > > > > > > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can > > > > classify positive & negative feedbacks using streaming expressions. > > All > > > > works but end result where probability_d result of classify > expression > > > > gives similar results for positive / negative feedback. See below > > > > > > > > What I may be missing here. Do i need to put more data in training > set > > > or > > > > something else? > > > > > > > > > > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], > > > > "score_d": 2.1892474120319667, "id": "6", "probability_d": > > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": > > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 > }, { > > > > "body_txt": [ "This company rewards its employees, but you should > only > > > work > > > > here if you truly love sales. The stress of the job can get to you > and > > > they > > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", > > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for > > > > advancement with that company every year I was there it got worse I > > don't > > > > know if all branches of adp but Florence organization was turn over > > rate > > > > would be higher if it was for temp workers" ], "score_d": > > > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, > { > > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The > > team > > > > that works there are professional and dedicated individuals. The > level > > of > > > > loyalty and dedication is impressive" ], "score_d": > 2.5303947056922937, > > > > "id": "2", "probability_d": 0.990430778418 }, > > > > > > > > > >
Re: alerting system with Solr's Streaming Expressions
Also you can see in the final iteration of the model that there are 8 true positives and 8 false positives. So this model classifies everything as positive. At that you know that it's not a good model. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernsteinwrote: > A few things that I see right off: > > 1) 2500 terms is too many. I was testing with 100-250 terms > 2) 1000 iterations is to high. If the model hasn't converged by 100 > iterations it's likely not going to converge. > 3) You're going to need more examples. You may want to run features first > and see what it selects. Then you need multiple examples for each feature. > I was testing with the enron ham/spam data set. It would be good to > download that dataset and see what that looks like. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumar > wrote: > >> Hello Joel, >> >> Here is the final iteration in json format. >> >> https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 >> >> Below is the expression used >> >> update(models, >> batchSize="50", >> train(trainingSet, >> features(trainingSet, >> q="*:*", >> featureSet="threatFeatures", >> field="body_txt", >> outcome="out_i", >> numTerms=2500), >> q="*:*", >> name="threatModel", >> field="body_txt", >> outcome="out_i", >> maxIterations="1000")) >> >> I just have 16 documents with 8+ve and 8-ves. The field which contains the >> feedback is body_txt (text_general type) >> >> Thanks for looking. >> >> >> >> On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein >> wrote: >> >> > Can you post the final iteration of the model? >> > >> > Also the expression you used to train the model? >> > >> > How much training data do you have? Ho many positive examples and >> negatives >> > examples? >> > >> > Joel Bernstein >> > http://joelsolr.blogspot.com/ >> > >> > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar >> > wrote: >> > >> > > Hello, >> > > >> > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can >> > > classify positive & negative feedbacks using streaming expressions. >> All >> > > works but end result where probability_d result of classify expression >> > > gives similar results for positive / negative feedback. See below >> > > >> > > What I may be missing here. Do i need to put more data in training >> set >> > or >> > > something else? >> > > >> > > >> > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], >> > > "score_d": 2.1892474120319667, "id": "6", "probability_d": >> > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": >> > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, >> { >> > > "body_txt": [ "This company rewards its employees, but you should only >> > work >> > > here if you truly love sales. The stress of the job can get to you and >> > they >> > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", >> > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for >> > > advancement with that company every year I was there it got worse I >> don't >> > > know if all branches of adp but Florence organization was turn over >> rate >> > > would be higher if it was for temp workers" ], "score_d": >> > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, { >> > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The >> team >> > > that works there are professional and dedicated individuals. The >> level of >> > > loyalty and dedication is impressive" ], "score_d": >> 2.5303947056922937, >> > > "id": "2", "probability_d": 0.990430778418 }, >> > > >> > >> > >
Re: alerting system with Solr's Streaming Expressions
A few things that I see right off: 1) 2500 terms is too many. I was testing with 100-250 terms 2) 1000 iterations is to high. If the model hasn't converged by 100 iterations it's likely not going to converge. 3) You're going to need more examples. You may want to run features first and see what it selects. Then you need multiple examples for each feature. I was testing with the enron ham/spam data set. It would be good to download that dataset and see what that looks like. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 9, 2017 at 10:15 AM, Susheel Kumarwrote: > Hello Joel, > > Here is the final iteration in json format. > > https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 > > Below is the expression used > > update(models, > batchSize="50", > train(trainingSet, > features(trainingSet, > q="*:*", > featureSet="threatFeatures", > field="body_txt", > outcome="out_i", > numTerms=2500), > q="*:*", > name="threatModel", > field="body_txt", > outcome="out_i", > maxIterations="1000")) > > I just have 16 documents with 8+ve and 8-ves. The field which contains the > feedback is body_txt (text_general type) > > Thanks for looking. > > > > On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernstein wrote: > > > Can you post the final iteration of the model? > > > > Also the expression you used to train the model? > > > > How much training data do you have? Ho many positive examples and > negatives > > examples? > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar > > wrote: > > > > > Hello, > > > > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can > > > classify positive & negative feedbacks using streaming expressions. > All > > > works but end result where probability_d result of classify expression > > > gives similar results for positive / negative feedback. See below > > > > > > What I may be missing here. Do i need to put more data in training set > > or > > > something else? > > > > > > > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], > > > "score_d": 2.1892474120319667, "id": "6", "probability_d": > > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": > > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, { > > > "body_txt": [ "This company rewards its employees, but you should only > > work > > > here if you truly love sales. The stress of the job can get to you and > > they > > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", > > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for > > > advancement with that company every year I was there it got worse I > don't > > > know if all branches of adp but Florence organization was turn over > rate > > > would be higher if it was for temp workers" ], "score_d": > > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, { > > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The > team > > > that works there are professional and dedicated individuals. The level > of > > > loyalty and dedication is impressive" ], "score_d": 2.5303947056922937, > > > "id": "2", "probability_d": 0.990430778418 }, > > > > > >
Re: alerting system with Solr's Streaming Expressions
Hello Joel, Here is the final iteration in json format. https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 Below is the expression used update(models, batchSize="50", train(trainingSet, features(trainingSet, q="*:*", featureSet="threatFeatures", field="body_txt", outcome="out_i", numTerms=2500), q="*:*", name="threatModel", field="body_txt", outcome="out_i", maxIterations="1000")) I just have 16 documents with 8+ve and 8-ves. The field which contains the feedback is body_txt (text_general type) Thanks for looking. On Wed, Feb 8, 2017 at 7:52 AM, Joel Bernsteinwrote: > Can you post the final iteration of the model? > > Also the expression you used to train the model? > > How much training data do you have? Ho many positive examples and negatives > examples? > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumar > wrote: > > > Hello, > > > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can > > classify positive & negative feedbacks using streaming expressions. All > > works but end result where probability_d result of classify expression > > gives similar results for positive / negative feedback. See below > > > > What I may be missing here. Do i need to put more data in training set > or > > something else? > > > > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], > > "score_d": 2.1892474120319667, "id": "6", "probability_d": > > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": > > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, { > > "body_txt": [ "This company rewards its employees, but you should only > work > > here if you truly love sales. The stress of the job can get to you and > they > > definitely push you." ], "score_d": 4.621702323888672, "id": "4", > > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for > > advancement with that company every year I was there it got worse I don't > > know if all branches of adp but Florence organization was turn over rate > > would be higher if it was for temp workers" ], "score_d": > > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, { > > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The team > > that works there are professional and dedicated individuals. The level of > > loyalty and dedication is impressive" ], "score_d": 2.5303947056922937, > > "id": "2", "probability_d": 0.990430778418 }, > > >
Re: alerting system with Solr's Streaming Expressions
Can you post the final iteration of the model? Also the expression you used to train the model? How much training data do you have? Ho many positive examples and negatives examples? Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Feb 7, 2017 at 2:14 PM, Susheel Kumarwrote: > Hello, > > I am tried to follow http://joelsolr.blogspot.com/ to see if we can > classify positive & negative feedbacks using streaming expressions. All > works but end result where probability_d result of classify expression > gives similar results for positive / negative feedback. See below > > What I may be missing here. Do i need to put more data in training set or > something else? > > > { "result-set": { "docs": [ { "body_txt": [ "love the company" ], > "score_d": 2.1892474120319667, "id": "6", "probability_d": > 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": > 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, { > "body_txt": [ "This company rewards its employees, but you should only work > here if you truly love sales. The stress of the job can get to you and they > definitely push you." ], "score_d": 4.621702323888672, "id": "4", > "probability_d": 0.99898557 }, { "body_txt": [ "no chance for > advancement with that company every year I was there it got worse I don't > know if all branches of adp but Florence organization was turn over rate > would be higher if it was for temp workers" ], "score_d": > 5.288898825826228, "id": "3", "probability_d": 0.9956 }, { > "body_txt": [ "It was a pleasure to work at the Milpitas campus. The team > that works there are professional and dedicated individuals. The level of > loyalty and dedication is impressive" ], "score_d": 2.5303947056922937, > "id": "2", "probability_d": 0.990430778418 }, >
alerting system with Solr's Streaming Expressions
Hello, I am tried to follow http://joelsolr.blogspot.com/ to see if we can classify positive & negative feedbacks using streaming expressions. All works but end result where probability_d result of classify expression gives similar results for positive / negative feedback. See below What I may be missing here. Do i need to put more data in training set or something else? { "result-set": { "docs": [ { "body_txt": [ "love the company" ], "score_d": 2.1892474120319667, "id": "6", "probability_d": 0.977944433135261 }, { "body_txt": [ "bad experience " ], "score_d": 3.1689453250842914, "id": "5", "probability_d": 0.9888109278133054 }, { "body_txt": [ "This company rewards its employees, but you should only work here if you truly love sales. The stress of the job can get to you and they definitely push you." ], "score_d": 4.621702323888672, "id": "4", "probability_d": 0.99898557 }, { "body_txt": [ "no chance for advancement with that company every year I was there it got worse I don't know if all branches of adp but Florence organization was turn over rate would be higher if it was for temp workers" ], "score_d": 5.288898825826228, "id": "3", "probability_d": 0.9956 }, { "body_txt": [ "It was a pleasure to work at the Milpitas campus. The team that works there are professional and dedicated individuals. The level of loyalty and dedication is impressive" ], "score_d": 2.5303947056922937, "id": "2", "probability_d": 0.990430778418 },