Re: Entity Disambiguation: Midterm

Pablo N. Mendes Fri, 10 Aug 2012 02:05:39 -0700

Hi all,
It will perhaps be useful to organize the discussion around methods, rather
than implementations. Talking about implementations may be specially
confusing because:
1) DBpedia Spotlight has DBpedia in the name. However, there are no
theoretical restrictions on the choice of KB, and not even actual technical
restrictions either, although in practice there might still a be few pieces
of hardcoded references in our codebase (which can be easily removed).
2) DBpedia Spotlight is an open source Scala/Java tool for you to install
and use in house. However, it offers a web service deployment for
demonstration that for obvious reasons does not expose all of the possible
combinations of functionality that the underlying code is able to offer.

Similarly to Stanbol, DBpedia Spotlight also assumes very little of
vocabularies. If all you have are labels, you can use our CandidateSearcher
and use a measure of "default sense" to pick a URI. We've experimented with
p(URI) as the overall prominence of an entity in the KB. We've also looked
at p(URI|label) as a measure for finding the default sense (a URI) for a
given label (the phrase found in text).

Now, if you have labels *and context*, you can do a lot more. We also offer
a ContextSearcher where given a label and a piece of text, one can obtain a
rank of the most likely URIs given that context. Comparisons are made based
on cosine similarity between vectors with tf*icf weights (modified tf*idf).
In practical terms, we search a Lucene index using a custom similarity
class. The task is to compare a vector made out of the input text with many
vectors representing each entity in your target KB. We call these entity
representations "context".

There are many ways to obtain context for entities at "training" time:
1) Lesk-style: perhaps the oldest technique, models context based on
"definitions" of each entity (dictionary style). If the incoming text
contains many terms in the definition of entities, then you assume that the
entity is close in meaning to the text, therefore is the right one to pick.
2) Shallow KB neighborhood: collects, for each entity, the labels of other
entities in the neighborhood based on the KB structure (this is what Rupert
mentioned). This is rather similar to Lesk-style, but has the cool feature
(in an RDF world) of not really requiring dictionary entries, but just
using the relationships in the KB to obtain more "words".
3) Occurrence/Mention-based: collects examples where the entity is known to
have occurred / been mentioned. These examples are paragraphs mentioning
the entity (and usually also other entities). So when the input text looks
like one of these paragraphs (rather, the aggregation of all these
paragraphs) for an entity, we assume that the entity is the right one to
pick.

For all three cases above, the model of the context is a vector of words
and can, therefore, use either Stanbol or DBpedia Spotlight's
implementations. Note that 3 will include both 1 and 2 (guaranteed for
Wikipedia, expected in general for most reasonable training data), and
that's why DBpedia Spotlight uses that by default. However, in practical
terms, all that DBpedia Spotlight asks for is "some text" that the user can
be free to generate however he/she wants.

Besides the 3 methods above, there are other graph-based algorithms, joint
inference for collective disambiguation algorithms, and so on. But I have
omitted them for brevity, as they are not directly related to the questions
raised by this thread.

It would be interesting to compare 1, 2 and 3 so that users of Stanbol can
have an idea of minimal accuracy expected in different cases, and how they
can increase as you provide more context.

Hope this helps.

Cheers,
Pablo

PS: I used "label" here where we usually use "surface form" in DBpedia
Spotlight. We consider "label" to be more like the "name" of an entity, or
the value for "rdfs:label", while "surface form" is any phrase used to
refer to an entity in text, even if it's not an rdfs:label. To keep it
simple, I also used "entity" where we usually talk about "resource" in
DBpedia Spotlight.

On Thu, Aug 9, 2012 at 10:06 PM, Rupert Westenthaler <
[email protected]> wrote:

> Hi,
>
> Stanbol currently assumes very little of Vocabularies. Basically you
> need only an URI and a label to get an Entity suggested.
>
> If you want to do some kind of disambiguation you will clearly need
> more information about Entities.
>
> Here the question is what kind of information the "spotlight approach"
> needs. AFAIK this approach is based on "surface forms" - labels used
> to refer to an Entity and "mentions" - sentences that mentions an
> Entity. Kritarth please correct me if I get this wrong. But if this is
> correct users would need to provide "mentions" for being able to use
> DBpedia spotlight like disambiguations.
>
> I think other rather typical information would be the "semantic
> context" - other entities referenced by an Entity. Based on that one
> can also do disambiguation (e.g. Solr MLT over the labels of the
> semantic context with the labels of the current sentence; or MLT over
> the URIs of the semantic Context with URIs of other extracted Entities
> in the current sentence/text section of the whole document).
>
> best
> Rupert
>
> On Thu, Aug 9, 2012 at 7:27 PM, kritarth anand <[email protected]>
> wrote:
> > I was not sure if spotlight approach would work for all kinds of
> > vocabularies that Stanbol might have.
> >
> > I was concerned that the structure of vocabulary it assumes is satisfied
> by
> > dbpedia but might not be satisfied by any custom vocabulary we might have
> > in any other deployment.
> >
> > On Thu, Aug 9, 2012 at 10:51 PM, Anuj Kumar <[email protected]> wrote:
> >
> >> Hi Kritarth,
> >>
> >> Thanks for the explanation. Spotlight approach sounds good to me but if
> you
> >> have time, it would be good to compare it with the other two for the
> >> purpose of this study.
> >>
> >> On the third point, I am still not clear. Do you want to convey that
> >> Spotlight's disambiguation algorithm can work only with DBpedia?
> >>
> >> Regards,
> >> Anuj
> >>
> >> On Thu, Aug 9, 2012 at 8:18 PM, kritarth anand <
> [email protected]
> >> >wrote:
> >>
> >> > Dear Anuj,
> >> >
> >> > Sorry for Delayed reply.
> >> >
> >> > 1. In the current implementation of Stanbol what we see essentially
> is.
> >> >       a. We find all the entities in the given paragraph
> >> >       b. For each entity query with a string of other entities as
> >> > additional info to query dbpedia
> >> >       c. Now we change the confidence values.
> >> >
> >> > 3. I'll answer this one first. I am not very sure of what Stanbol
> expects
> >> > from a vocabulary. All the other papers I read had seen were not
> making
> >> any
> >> > assumptions on Vocabulary mainly they were using Wikipedia. I was
> >> confused
> >> > if it meant more flexibility. After discussion with Pablo and Rupert.
> I
> >> > think it is a way to go.
> >> >
> >> > 2. I am inclined towards using Spotlight approach as it seems to be
> >> better
> >> > than the other too and I would like comments from you guys if it is a
> >> good
> >> > way to proceed.
> >> >
> >> > Kritarth
> >> >
> >> >
> >> > On Sun, Jul 29, 2012 at 11:29 AM, Anuj Kumar <[email protected]>
> wrote:
> >> >
> >> > > Hi Kritarth,
> >> > >
> >> > > Thanks for sharing the details. I have few questions-
> >> > >
> >> > > 1. Can you elaborate the current implementation? Is it using the
> >> existing
> >> > > MLT feature?
> >> > > 2. Which one of the three algorithms are you planning to use?
> >> > > 3. On the spotlight part, can you explain more on why you say- "I am
> >> not
> >> > > sure if we can play around that much with any vocabulary and not
> just
> >> > > DBpedia."?
> >> > >
> >> > > Also, there is a minor typo in the report under Approach section-
> "Yhe
> >> > > behavior
> >> > > can be explained as follows:"
> >> > >
> >> > > Thanks,
> >> > > Anuj
> >> > >
> >> > > On Wed, Jul 25, 2012 at 3:20 PM, kritarth anand <
> >> > [email protected]
> >> > > >wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > I would like to start more interaction with the Stanbol Community
> by
> >> > > > sharing the first iteration of the Entity Disambiguation Engine. I
> >> > would
> >> > > > really like you all to take a look at it and give me your valuable
> >> > > opinion.
> >> > > >
> >> > > > https://github.com/kritarthanand/Disambiguation-Stanbol
> >> > > >
> >> > > > The repo consists of the engines' code.It is very easy to install,
> >> the
> >> > > > instructions are present in the Readme file.
> >> > > >
> >> > > > Besides the engine it also contains my Mid Term Report which
> >> describes
> >> > > the
> >> > > > engine a little and also talks about future possible algorithms
> that
> >> > can
> >> > > be
> >> > > > used for Entity Disambiguation. Disambiguation is a complex
> problem
> >> and
> >> > > we
> >> > > > should have an efficient and performs well too. Therefore I would
> >> > really
> >> > > > like Stanbol community to take part in discussion with Enthusiasm.
> >> > > >
> >> > > > Please share your views,
> >> > > >
> >> > > >
> >> > > > Kritarth
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

-- 
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr

Re: Entity Disambiguation: Midterm

Reply via email to