@Ken Thanks for the hints... I am coming from a payload based system so I am aware if them, however in the lucene 3.6 branch boosting and payloads didn't work together (if you set PayloadTermQuery.setIncludeSpanScore to false, they were ignored)
Besides that, there is no performance issue here so far so it's probably a fine way to go, i was just curious...as for the IntField / TrieIntField, all the range query / ordering benefits of it are overhead since the integers just represent random indices into a vector. I might look into indexing the integer bytes rather than the string separation... @Ted You are probably right with chosing 1 as term frequency, i forgot that the most interesting information comes from the idf probably and using cooccurrence counts as term frequency might make the combination with text searches infeasible since the values lie in some totally different range. Also I forgot that idf is per field so i might go for separating the hashed values into their originating fields (search tern, item_id, category_id) . This would still alllow to recombine them later when a use profile has to be constructed. I like to threshold with LLR. That gives me a binary matrix. Then I > directly index that. > The search engine provides very nice weights at this point. I don't feel > the need to adjust those weights because they have roughly the same form as > learned weights are likely to have and because learning those weights would > almost certainly result in over-fitting unless I go to quite a lot of > trouble. > Also, I have heard that at least one head-to-head test found that the > native Solr term weighting actually out-performed several more intricate > and explicit weighting schemes. That can't be taken as evidence that > Solr's weightings would perform better than whatever you have in mind, but > it does provide interesting meta-evidence that the probability that a > reasonably smart dev team is definitely not guaranteed to beat Solr's > weighting by a large margin. When you sit down to architect your system, > you need to make decisions about where to spend your time and evidence like > that is helpful to guess how much effort it would take to achieve different > levels of performance. I am also thresholding the counts with LLR. Every time i do this I take a threshold of 10 since I loosely remember it being about the 99% margin of confidence in the chi square distribution. I got no clue however if anybody wants something like 99% for recommendations or if 50% might be a better value. What's your experience on that? And do you apply a limit on the total number of docs per term, since there could be big boolean queries tearing down the performance? Thanks for all the input! On Mon, Feb 11, 2013 at 7:20 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > ... > > i am currently implementing a system of the same kind, LLR sparsified > > "term"-cooccurrence vectors in lucene (since not a day goes by where i > see > > Ted praising this). > > > > (turns red) > > > > There are not only views and purchases, but also search terms, facets > and a > > lot more textual information to be included in the cooccurrence matrix > (as > > "input"). > > That's why i went with the feature hashing framework in mahout. This > gives > > small (hd/mem) user profiles and allows for reusing the vectors for click > > prediction and/or clustering. > > > This is a reasonable choice. For recommendations, you might want to use > direct encoding since it can be simpler to build a search index for > recommending. > > > > The main difference is that there's only two > > fields in lucene with a lot of terms (Numbers), representing the > features. > > Two fields because i think predicting views (besides purchases) might in > > some cases be better than predicting nothing. > > > > OK. > > > > I don't think it should make a big differing in scoring because in a > > vector space model used by most engines it's just, well a vector space > and > > i don't know if the field norm make sense after stripping values from the > > term vectors with the LLR threshold. > > > > Having separate fields is going to give separate total term counts. That > seems better to me, but I have to confess I have never rigorously tested > that. > > > > @Ted > > > It is handy to simply use the binary values of the sparsified versions > of > > >these and let the search engine handle the weighting of different > > >components at query time. > > > > Do you really want to omit the cooccurrence counts which would become the > > term frequecies? how would the engine then weight different inputs > against > > each other? > > > > I like to threshold with LLR. That gives me a binary matrix. Then I > directly index that. > > The search engine provides very nice weights at this point. I don't feel > the need to adjust those weights because they have roughly the same form as > learned weights are likely to have and because learning those weights would > almost certainly result in over-fitting unless I go to quite a lot of > trouble. > > Also, I have heard that at least one head-to-head test found that the > native Solr term weighting actually out-performed several more intricate > and explicit weighting schemes. That can't be taken as evidence that > Solr's weightings would perform better than whatever you have in mind, but > it does provide interesting meta-evidence that the probability that a > reasonably smart dev team is definitely not guaranteed to beat Solr's > weighting by a large margin. When you sit down to architect your system, > you need to make decisions about where to spend your time and evidence like > that is helpful to guess how much effort it would take to achieve different > levels of performance. > > And, if anyone knows a > > 1. smarter way to index the cooccurrence counts in lucene than a > > tokenstream that emits a word k times for a cooccurrence count of k > > > > You can use payloads or you can boost individual terms. > > > > 2. way to avoid treating the (hashed) vector column indices as terms but > > reusing them? It's a bit weird hashing to an int and then having the > lucene > > term dictionary treating them as string, mapping to another int > > > > Why do we care about this? These tokens get put onto documents that have > additional data to help them make sense, but why do we care if the tokens > look like numbers? >