Thank you Tim. I appreciated the tips. At this point, I'm just trying to understand how to use it. The 30 tweets that I've selected so far, are, in fact threatening. The things people say! My favorite so far is 'disingenuous twat waffle'. No kidding.

The issue that I'm having is not with the model, it's with creating the model from a query other than *:*.

Example:

update(models2, batchSize="50",
             train(TRAINING,
                      features(TRAINING,
                                     q="*:*",
                                     featureSet="threat1",
                                     field="ClusterText",
                                     outcome="out_i",
                                     positiveLabel=1,
                                     numTerms=100),
                      q="*:*",
                      name="threat1",
                      field="ClusterText",
                      outcome="out_i",
                      maxIterations="100"))

Works great. Makes a model - model works - can see reasonable results. However, say I've tagged a training set inside a larger collection called COL1 with a field called JoeID - like this:

update(models2, batchSize="50",
             train(COL1,
                      features(COL1,
                                     q="JoeID:Training",
                                     featureSet="threat2",
                                     field="ClusterText",
                                     outcome="out_i",
                                     positiveLabel=1,
                                     numTerms=1000),
                      q="JoeID:Training",
                      name="threat2",
                      field="ClusterText",
                      outcome="out_i",
                      maxIterations="100"))

This does not work as expected. I can query the COL1 collection for JoeID:Training, and get a result set that I want to train on, but the model creation seems to not work. At this point, if I want to make a model, I need to create a collection, put the training set into it, and then train on *:*. This is fine, but I'm not sure if it's how it is supposed to work.

-Joe


On 3/21/2017 10:17 PM, Tim Casey wrote:
Joe,

To do this correctly, soundly, you will need to sample the data and mark
them as threatening or neutral.  You can probably expand on this quite a
bit, but that would be a good start.  You can then draw another set of
samples and see how you did.  You use one to train and one to validate.

What you are doing is probably just noise, from a model point of view, and
it will probably not make too much difference how you index/query/model
through the noise.

I don't mean this critically, just plainly.  Effectively the less
mathematically correctly you do this process, the more anecdotal the result.

tim


On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <joels...@gmail.com> wrote:

I've only tested with the training data in it's own collection, but it was
designed for multiple training sets in the same collection.

I suspect you're training set is too small to get a reliable model from.
The training sets we tested with were considerably larger.

All the idfs_ds values being the same seems odd though. The idfs_ds in
particular were designed to be accurate when there are multiple training
sets in the same collection.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

If I put the training data into its own collection and use q="*:*", then
it works correctly.  Is that a requirement?
Thank you.

-Joe



On 3/20/2017 3:47 PM, Joe Obernberger wrote:

I'm trying to build a model using tweets.  I've manually tagged 30
tweets
as threatening, and 50 random tweets as non-threatening.  When I build
the
mode with:

update(models2, batchSize="50",
              train(UNCLASS,
                       features(UNCLASS,
                                      q="ProfileID:PROFCLUST1",
                                      featureSet="threatFeatures3",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=250),
                       q="ProfileID:PROFCLUST1",
                       name="threatModel3",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

It appears to work, but all the idfs_ds values are identical. The
terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
For out_i it is either -1 for non-threatening tweets, and +1 for
threatening tweets.  I'm trying to follow along with Joel Bernstein's
excellent post here:
http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
ystem-with-solrs.html

Tips?

Thank you!

-Joe



Reply via email to