Re: TF in wide internet crawls

lewis john mcgibbney Wed, 27 Jul 2011 09:32:25 -0700

Hi Markus,

I am getting you until the last parts of your comments.

"cope with non-edited..." edited by whom? and for what purpose? To give a
better relative tf score...

To comment on the first part, and please ignore or correct me if I am wrong,
but do we not give each page and therefore each document an initial score of
1.0 which is then subsequently used by whichever scoring algorithm we
plugin? If this is the case then how are we specifying score for a page and
tf of some term with a document or tf-idf of that term over the entire
document collection to determine relevance? How can be accurately
disambiguate between these entities?

As I said I'm loosing you towards the end however it would be good
discussion to explore behind the surface architecture.

On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi,
>
> I've done several projects where term frequency yields bad result sets and
> worse relevancy. These projects all had one similarity; user-generated
> content
> with a competitive edge. The latter means classifieds web sites such as
> e-bay
> etc. The internet is something similar. It contains edited content,
> classifieds
> and spam or other garbage.
>
> What do you do with tf in your wide internet index? Do you impose a
> threshold
> or are you emitting 1.0f for each match?
> For now i emit 1.0f for each match and rely on matches in multiple fields
> with
> varying boosts to improve relevancy and various other methods.
>
> Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen
> great relevancy with good content but really bad relevance in several
> cases.
>
> Thanks!
>

-- 
*Lewis*

Re: TF in wide internet crawls

Reply via email to