2012/1/27 Riccardo Tasso <[email protected]>: > > That's exactly what I mean. The fact is that in our interpretation of > Wikipedia, not all the sentences are annotated. That is because not all the > sentences containing an entity requires linking. So I'm thinking of using > only a better subset of my sentences (since they are so much). From this the > idea of sampling only featured pages: stubs or poor pages may have a greater > probability of being poorly annotated.
I my case I only keep sentences that have at least one annotation (e.g. a link that maps to one of the types I am interested in). > The idea may also be extended with the other proposal, which I'll try to > explain with an example. Imagine a page about a vegetable. If a city appears > in a sentence inside this page, it could be possible that it will appear not > linked (i.e. not annotated) since the topics of the article aren't as much > related. Otherwise I suspect that in a page talking about Geography, places > are tagged more frequently. This is obviously an hypothesis, which shoul be > better verify. > > Another idea is to use only sentences containing links regarding the > entities which may be interesting. For example: > * "[[Milan|Milan]] is an industrial city" becomes: "<place>Milan</place> is > an industrial city" > * "[[Paris|Paris Hilton]] was drunk last Friday." becomes: "Paris was drunk > last Friday" (this sentence is kept because the link text is in the list of > candidates to be tagged as places, but in this case the anchor suggest us it > isn't so, hence is a good negative example) > "Paris is a very touristic city." is discarded because it doesn't contain > any interesting link I am not sure that "link richness" is related to the "entity relatedness" of the topic of the article. That hypothesis would require some data-driven validation. Another fact to consider: on the page http://en.wikipedia.org/wiki/Paris_Hilton , most occurrences of the "Paris Hilton" as a name of a person are not linked (because that would be confusing for user to link to the same page). So it would be possible to pre-process the markup by adding those recursive link on pages that refer to entities with interesting types. Yet another bias: if you take a page like: http://en.wikipedia.org/wiki/The_Simple_Life that mentions Paris Hilton many times, only the first few occurrences are links. The remaining occurrences of the firstname "Paris" are never linked: that a huge false negative bias. Again a dedicate preprocessing heuristic to propagate recurring name annotations inside a given page automatically might help. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
