>>What is the "best practices" formula for determining above average
correlations of adjacent terms
I gave this some thought in
https://issues.apache.org/jira/browse/LUCENE-474
I found the Jaccard cooefficient favoured rare words too strongly and
so went for a blend as shown below:
public float getScore()
{
float overallIntersectionPercent = coIncidenceDocCount
/ (float) (termADocFreq + termBDocFreq);
float termBIntersectionPercent = coIncidenceDocCount
/ (float) (termBDocFreq);
//using just the termB intersection favours common words as
// coincidents eg "new" food
// return termBIntersectionPercent;
//using just the overall intersection favours rare words as
// coincidents eg "scezchuan" food
// return overallIntersectionPercent;
// so here we take an average of the two:
return (termBIntersectionPercent + overallIntersectionPercent)
/ 2;
}
------------------------------------------------------------------------
*From:* Mark Bennett <mbenn...@ideaeng.com>
*To:* dev@lucene.apache.org
*Sent:* Fri, 10 September, 2010 18:44:31
*Subject:* Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
Thanks Mark H,
Maybe I'll look at MLT (More Like This) again. I'll also check out zipf.
It's claimed that Question and Answer wording is different enough for
generic text content that different techniques might be indicated.
From what I remember:
1: Though nouns normally convey 60% of relevancy in general text, Q&A
content is skewed a bit more towards verbs.
2: Questions may contain more noise words (though perhaps in useful
groupings)
3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q
vs A)
4: Vocabulary mismatch of novices vs experts (Q vs A)
It was item 2 that I was hoping to capitalize on with NGrams / Shingles.
Still waiting for the relevancy math nerds to chime in about the
log-log and IDF stuff ... ;-)
I was thinking a bit more about the math involved here....
What is the "best practices" formula for determining above average
correlations of adjacent terms, beyond what random chance would give.
So you notice that "white" and "house" appear next to each other more
than what chance distribution would explain, so you decide it's an
important NGram.
The "noise floor" isn't too bad for the typical shopping cart items
calculation.
You analyze the items present or not present in 1,000 shopping cart
receipts.
If grocery items were completely independent then "random" level
is just the odds of the 2 items multiplied together:
1,000 shopping carts
200 have cereal
250 have milk
chance of
cereal = 200/1,000 = 20%
milk = 250/1,000 = 25%
IF independent then
P(cereal AND milk) = P(cereal) * P(milk)
20% * 25% = 5%
So 50 carts likely to have both cereal and milk
And if MORE than 50 carts have cereal and milk, then it's
worth noting.
The classic example is diapers and beer, which is a bit apocryphal and
NOT expected, but I like the breakfast cereal and milk example better
because it IS expected.
Now back to word-A appearing directly before word-B, and finding the
base level number of times you'd expect just from random chance.
Although Lucene/Luke gives you total word instances and document
counts, what you'd really want is the number of possible N-Grams,
which is affected by document boundaries, so it gets a little weird.
Some other differences between the word-A word-B calculation vs milk
and cereal:
1: I want ordered pairs, "white" before "house"
2: A document is NOT like a shopping cart in that I DO care how many
times "white" appears before "house", whereas in the shopping carts I
only cared about present or not present, so document count is less
helpful here.
I'm sure some companies and PHD's have super secret formulas for this,
but I'd be content to just compare it to baseline random chance.
Mark B
--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
<mailto:mbenn...@ideaeng.com>
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
On Fri, Sep 10, 2010 at 3:17 AM, mark harwood <markharw...@yahoo.co.uk
<mailto:markharw...@yahoo.co.uk>> wrote:
Hi Mark
I've played with Shingles recently in some auto-categorisation
work where my starting assumption was that multi-word terms will
hold more information value than individual words and that phrase
queries on seperate terms will not give these term combos their
true reward (in terms of IDF) - or if they did compute the true
IDF, would require lots of disk IO to do this. Shingles present a
conveniently pre-aggregated score for these combos.
Looking at the results of MoreLikeThis queries based on a
shingling analyzers the results I saw generally seemed good but
did not formally bench mark this against non-shingled indexes. Not
everything was rosy in that I did see some tendency to over-reward
certain rare shingles (e.g. a shared mention of "New Years Eve
Party" pulled otherwise mostly unrelated news articles together).
This led me to look at using the links in resulting documents to
help identify clusters of on-topic and potentially off-topic
results to tune these discrepancies out but that's another topic.
BTW, the Luke tool has a "Zipf" plugin that you may find useful in
examining index term distributions in Lucene indexes..
Cheers
Mark
------------------------------------------------------------------------
*From:* Mark Bennett <mbenn...@ideaeng.com
<mailto:mbenn...@ideaeng.com>>
*To:* java-...@lucene.apache.org <mailto:java-...@lucene.apache.org>
*Sent:* Fri, 10 September, 2010 1:42:11
*Subject:* Relevancy, Phrase Boosting, Shingles and Long Tail Curves
I want to boost the relevancy of some Question and Answer content.
I'm using stop words, Dismax, and I'm already a fan of Phrase
Boosting and have cranked that up a bit. But I'm considering using
long Shingles to make use of some of the normally stopped out
"junk words" in the content to help relevancy further.
Reminder: "Shingles" are artificial tokens created by gluing
together adjacent words.
Input text: This is a sentence
Normal tokens: this, is, a, sentence (without removing stop
words)
2+3 word shingles: this-is, is-a, a-sentence, this-is-a,
is-a-sentence
A few questions on relevance and shingles:
1: How similar are the relevancy calculations compare between
Shingles and exact phrases?
I've seen material saying that shingles can give better
performance than normal phrase searching, and I'm assuming this is
exact phrase (vs. allowing for phrase slop)
But do the relevancy calculations for normal exact phrase and
Shingles wind up being *identical*, for the same documents and
searches? That would seem an unlikely coincidence, but possibly
it could have been engineered to intentionally behave that way.
2: What's the latest on Shingles and Dismax?
The low front end low level tokenization in Dismax would seem to
be a problem, but does the new parser stuff help with this?
3: I'm thinking of a minimum 3 word shingle, does anybody have
comments on shingle length?
Eyeballing the 2 word shingles, they don't seem much better than
stop words. Obviously my shingle field bypasses stop words.
But the 3 word shingles start to look more useful, expressing more
intent, such as "how do i", "do i need" and "it work with", etc.
Has there been any Lucene/Solr studies specifically on shingle length?
and finally...
4: Is it useful to examine your token occurrences against a
Power-Law log-log curve?
So, with either single words, or shingles, you do a histogram, and
then plot the histogram in an X-Y graph, with both axis being
logarithmic. Then see if the resulting graph follows (or diverges)
from a straight line. This "Long Tail" / Pareto / powerlaw
mathematics were very popular a few years ago for looking at
histograms of DVD rentals and human activities, and prior to the
web, the power law and 80/20 rules has been observed in many other
situations, both man made and natural.
Also of interest, when a distribution is expected to follow a
power line, but the actual data deviates from that theoretical
line, then this might indicate some other factors at work, or so
the theory goes.
So if users' searches follow any type of histogram with a hidden
powerlaw line, then it makes sense to me that the source content
might also follow a similar distribution. Is the normal IDF
ranking inspired by that type of curve?
And *if* word occurrences, in either searches or source documents,
were expected to follow a power law distribution, then possible
shingles would follow such a curve as well.
Thinking that document text, like many other things in nature,
might follow such a curve, I used the Lucene index to generate
such a curve. And I did the same thing for 3 word tokens. The 2
curves do have different slopes, but neither is very straight.
So I was wondering if anybody else has looked at IDF curves
(actually non-inverted document frequency curves) or raw word
instance counts and power law graphs? I haven't found a smoking
gun in my online searches, but I'm thinking some of you would know
this.
--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
<mailto:mbenn...@ideaeng.com>
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513