@Ken
Thanks for the hints...
I am coming from a payload based system so I am aware if them, however in
the lucene 3.6 branch boosting and payloads didn't work together (if you
set PayloadTermQuery.setIncludeSpanScore to false, they were ignored)

Besides that, there is no performance issue here so far so it's probably a
fine way to go, i was just curious...as for the IntField /  TrieIntField,
all the  range query / ordering benefits of it are overhead since the
integers just represent random indices into a vector. I might look into
indexing the integer bytes rather than the string separation...

@Ted
You are probably right with chosing 1 as term frequency, i forgot that the
most interesting information comes from the idf probably and using
cooccurrence counts as term frequency might make the combination with text
searches infeasible since the values lie in some totally different range.
Also I forgot that idf is per field so i might go for separating the hashed
values into their originating fields (search tern, item_id, category_id) .
This would still alllow to recombine them later when a use profile has to
be constructed.

I like to threshold with LLR.  That gives me a binary matrix.  Then I
> directly index that.
> The search engine provides very nice weights at this point.  I don't feel
> the need to adjust those weights because they have roughly the same form as
> learned weights are likely to have and because learning those weights would
> almost certainly result in over-fitting unless I go to quite a lot of
> trouble.
> Also, I have heard that at least one head-to-head test found that the
> native Solr term weighting actually out-performed several more intricate
> and explicit weighting schemes.  That can't be taken as evidence that
> Solr's weightings would perform better than whatever you have in mind, but
> it does provide interesting meta-evidence that the probability that a
> reasonably smart dev team is definitely not guaranteed to beat Solr's
> weighting by a large margin.  When you sit down to architect your system,
> you need to make decisions about where to spend your time and evidence like
> that is helpful to guess how much effort it would take to achieve different
> levels of performance.



I am also thresholding the counts with LLR. Every time i do this I take a
threshold of 10 since I loosely remember it  being about the 99% margin of
confidence in the chi square distribution. I got no clue however if anybody
wants something like 99% for recommendations or if 50% might be a better
value. What's your experience on that?

And do you apply a limit on the total number of docs per term, since there
could be big boolean queries tearing down the performance?

Thanks for all the input!



On Mon, Feb 11, 2013 at 7:20 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > ...
> > i am currently implementing a system of the same kind, LLR sparsified
> > "term"-cooccurrence vectors in lucene (since not a day goes by where i
> see
> > Ted praising this).
> >
>
> (turns red)
>
>
> > There are not only views and purchases, but also search terms, facets
> and a
> > lot more textual information to be included in the cooccurrence matrix
> (as
> > "input").
> > That's why i went with the feature hashing framework in mahout. This
> gives
> > small (hd/mem) user profiles and allows for reusing the vectors for click
> > prediction and/or clustering.
>
>
> This is a reasonable choice.  For recommendations, you might want to use
> direct encoding since it can be simpler to build a search index for
> recommending.
>
>
> > The main difference is that there's only two
> > fields in lucene with a lot of terms (Numbers), representing the
> features.
> > Two fields because i think predicting views (besides purchases) might in
> > some cases be better than predicting nothing.
> >
>
> OK.
>
>
> > I don't think it  should make a big differing in scoring because in a
> > vector space model used by most engines it's just, well a vector space
> and
> > i don't know if the field norm make sense after stripping values from the
> > term vectors with the LLR threshold.
> >
>
> Having separate fields is going to give separate total term counts.  That
> seems better to me, but I have to confess I have never rigorously tested
> that.
>
>
> > @Ted
> > > It is handy to simply use the binary values of the sparsified versions
> of
> > >these and let the search engine handle the weighting of different
> > >components at query time.
> >
> > Do you really want to omit the cooccurrence counts which would become the
> > term frequecies? how would the engine then weight different inputs
> against
> > each other?
> >
>
> I like to threshold with LLR.  That gives me a binary matrix.  Then I
> directly index that.
>
> The search engine provides very nice weights at this point.  I don't feel
> the need to adjust those weights because they have roughly the same form as
> learned weights are likely to have and because learning those weights would
> almost certainly result in over-fitting unless I go to quite a lot of
> trouble.
>
> Also, I have heard that at least one head-to-head test found that the
> native Solr term weighting actually out-performed several more intricate
> and explicit weighting schemes.  That can't be taken as evidence that
> Solr's weightings would perform better than whatever you have in mind, but
> it does provide interesting meta-evidence that the probability that a
> reasonably smart dev team is definitely not guaranteed to beat Solr's
> weighting by a large margin.  When you sit down to architect your system,
> you need to make decisions about where to spend your time and evidence like
> that is helpful to guess how much effort it would take to achieve different
> levels of performance.
>
> And, if anyone knows a
> > 1. smarter way to index the cooccurrence counts in lucene than a
> > tokenstream that emits a word k times for a cooccurrence count of k
> >
>
> You can use payloads or you can boost individual terms.
>
>
> > 2. way to avoid treating the (hashed) vector column indices as terms but
> > reusing them? It's a bit weird hashing to an int and then having the
> lucene
> > term dictionary treating them as string, mapping to another int
> >
>
> Why do we care about this?  These tokens get put onto documents that have
> additional data to help them make sense, but why do we care if the tokens
> look like numbers?
>

Reply via email to