Not entirely without regard to weight.  Just without regard to designing
weights specific to this application.  The weights that Solr uses natively
are intuitively what we want (rare indicators have higher weights in a
log-ish kind of way).

Frankly, I doubt the effectiveness here of mathematical reasoning for
getting a better weighting.  The deviations from optimal relative to the
Solr defaults are probably as large as the deviations from the assumptions
that the mathematically motivated weightings are based on.  Fixing this is
spending a lot for small potatoes.   Fixing the data flow and getting
access to more data is far higher value.



On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> So you are proposing just grabbing the top N scoring related items and
> indexing listing them without regard to weight?  Effectively quantizing the
> weights to = 1, and 0 for everything else?  I guess LLR tends to do that
> anyway
>
> -Mike
>
>
> On 07/22/2013 02:57 PM, Ted Dunning wrote:
>
>> My experience is that TFIDF works just fine, especially as first cut.
>>
>> Adding different kinds of data, building out backend A/B testing, tuning
>> the UI, weighting the query all come the next round of weighting changes.
>>   Typically, the priority stack never empties enough for that task to rise
>> to the top.
>>
>>
>> On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov <
>> msoko...@safaribooksonline.com**> wrote:
>>
>>  On 07/22/2013 12:20 PM, Pat Ferrel wrote:
>>>
>>>  My understanding of the Solr proposal puts B's row similarity matrix in
>>>> a
>>>> vector per item. That means each row is turned into "terms" = external
>>>> IDs--not sure how the weights of each term are encoded.
>>>>
>>>>  This is the key question for me. The best idea I've had is to use
>>> termFreq
>>> as a proxy for weight.  It's only an integer, so there are scaling issues
>>> to consider, but you can apply a per-field weight to manage that.  Also,
>>> Lucene (and Solr) doesn't provide an obvious way to load term frequencies
>>> directly: probably the simplest thing to do is just to repeat the
>>> cross-term N times and let the text analysis take care of counting them.
>>>   Inefficient, but probably the quickest way to get going.
>>>  Alternatively,
>>> there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
>>> which I haven't really plumbed entirely, but would allow for more direct
>>> loading of fields.
>>>
>>> Then one probably wants to override the scoring in some way (unless TFIDF
>>> is the way to go somehow??)
>>>
>>>
>>>
>

Reply via email to