2012/1/27 Riccardo Tasso <[email protected]>:
>
> That's exactly what I mean. The fact is that in our interpretation of
> Wikipedia, not all the sentences are annotated. That is because not all the
> sentences containing an entity requires linking. So I'm thinking of using
> only a better subset of my sentences (since they are so much). From this the
> idea of sampling only featured pages: stubs or poor pages may have a greater
> probability of being poorly annotated.

I my case I only keep sentences that have at least one annotation
(e.g. a link that maps to one of the types I am interested in).

> The idea may also be extended with the other proposal, which I'll try to
> explain with an example. Imagine a page about a vegetable. If a city appears
> in a sentence inside this page, it could be possible that it will appear not
> linked (i.e. not annotated) since the topics of the article aren't as much
> related. Otherwise I suspect that in a page talking about Geography, places
> are tagged more frequently. This is obviously an hypothesis, which shoul be
> better verify.
>
> Another idea is to use only sentences containing links regarding the
> entities which may be interesting. For example:
> * "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> is
> an industrial city"
> * "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was drunk
> last Friday" (this sentence is kept because the link text is in the list of
> candidates to be tagged as places, but in this case the anchor suggest us it
> isn't so, hence is a good negative example)
> "Paris is a very touristic city." is discarded because it doesn't contain
> any interesting link

I am not sure that "link richness" is related to the "entity
relatedness" of the topic of the article. That hypothesis would
require some data-driven validation.

Another fact to consider: on the page
http://en.wikipedia.org/wiki/Paris_Hilton , most occurrences of the
"Paris Hilton" as a name of a person are not linked (because that
would be confusing for user to link to the same page). So it would be
possible to pre-process the markup by adding those recursive link on
pages that refer to entities with interesting types.

Yet another bias: if you take a page like:
http://en.wikipedia.org/wiki/The_Simple_Life that mentions Paris
Hilton many times, only the first few occurrences are links. The
remaining occurrences of the firstname "Paris" are never linked: that
a huge false negative bias. Again a dedicate preprocessing heuristic
to propagate recurring name annotations inside a given page
automatically might help.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to