Re: Entity Disambiguation: Midterm

Anuj Kumar Sun, 12 Aug 2012 01:45:08 -0700

Thanks Kritarth, Rupert and Pablo.This brings in a lot of clarity.

Regards,
Anuj


On Fri, Aug 10, 2012 at 2:35 PM, Pablo N. Mendes <[email protected]>wrote:

> Hi all,
> It will perhaps be useful to organize the discussion around methods, rather
> than implementations. Talking about implementations may be specially
> confusing because:
> 1) DBpedia Spotlight has DBpedia in the name. However, there are no
> theoretical restrictions on the choice of KB, and not even actual technical
> restrictions either, although in practice there might still a be few pieces
> of hardcoded references in our codebase (which can be easily removed).
> 2) DBpedia Spotlight is an open source Scala/Java tool for you to install
> and use in house. However, it offers a web service deployment for
> demonstration that for obvious reasons does not expose all of the possible
> combinations of functionality that the underlying code is able to offer.
>
> Similarly to Stanbol, DBpedia Spotlight also assumes very little of
> vocabularies. If all you have are labels, you can use our CandidateSearcher
> and use a measure of "default sense" to pick a URI. We've experimented with
> p(URI) as the overall prominence of an entity in the KB. We've also looked
> at p(URI|label) as a measure for finding the default sense (a URI) for a
> given label (the phrase found in text).
>
> Now, if you have labels *and context*, you can do a lot more. We also offer
> a ContextSearcher where given a label and a piece of text, one can obtain a
> rank of the most likely URIs given that context. Comparisons are made based
> on cosine similarity between vectors with tf*icf weights (modified tf*idf).
> In practical terms, we search a Lucene index using a custom similarity
> class. The task is to compare a vector made out of the input text with many
> vectors representing each entity in your target KB. We call these entity
> representations "context".
>
> There are many ways to obtain context for entities at "training" time:
> 1) Lesk-style: perhaps the oldest technique, models context based on
> "definitions" of each entity (dictionary style). If the incoming text
> contains many terms in the definition of entities, then you assume that the
> entity is close in meaning to the text, therefore is the right one to pick.
> 2) Shallow KB neighborhood: collects, for each entity, the labels of other
> entities in the neighborhood based on the KB structure (this is what Rupert
> mentioned). This is rather similar to Lesk-style, but has the cool feature
> (in an RDF world) of not really requiring dictionary entries, but just
> using the relationships in the KB to obtain more "words".
> 3) Occurrence/Mention-based: collects examples where the entity is known to
> have occurred / been mentioned. These examples are paragraphs mentioning
> the entity (and usually also other entities). So when the input text looks
> like one of these paragraphs (rather, the aggregation of all these
> paragraphs) for an entity, we assume that the entity is the right one to
> pick.
>
>
> For all three cases above, the model of the context is a vector of words
> and can, therefore, use either Stanbol or DBpedia Spotlight's
> implementations. Note that 3 will include both 1 and 2 (guaranteed for
> Wikipedia, expected in general for most reasonable training data), and
> that's why DBpedia Spotlight uses that by default. However, in practical
> terms, all that DBpedia Spotlight asks for is "some text" that the user can
> be free to generate however he/she wants.
>
> Besides the 3 methods above, there are other graph-based algorithms, joint
> inference for collective disambiguation algorithms, and so on. But I have
> omitted them for brevity, as they are not directly related to the questions
> raised by this thread.
>
> It would be interesting to compare 1, 2 and 3 so that users of Stanbol can
> have an idea of minimal accuracy expected in different cases, and how they
> can increase as you provide more context.
>
> Hope this helps.
>
> Cheers,
> Pablo
>
> PS: I used "label" here where we usually use "surface form" in DBpedia
> Spotlight. We consider "label" to be more like the "name" of an entity, or
> the value for "rdfs:label", while "surface form" is any phrase used to
> refer to an entity in text, even if it's not an rdfs:label. To keep it
> simple, I also used "entity" where we usually talk about "resource" in
> DBpedia Spotlight.
>
>
> On Thu, Aug 9, 2012 at 10:06 PM, Rupert Westenthaler <
> [email protected]> wrote:
>
> > Hi,
> >
> > Stanbol currently assumes very little of Vocabularies. Basically you
> > need only an URI and a label to get an Entity suggested.
> >
> > If you want to do some kind of disambiguation you will clearly need
> > more information about Entities.
> >
> > Here the question is what kind of information the "spotlight approach"
> > needs. AFAIK this approach is based on "surface forms" - labels used
> > to refer to an Entity and "mentions" - sentences that mentions an
> > Entity. Kritarth please correct me if I get this wrong. But if this is
> > correct users would need to provide "mentions" for being able to use
> > DBpedia spotlight like disambiguations.
> >
> > I think other rather typical information would be the "semantic
> > context" - other entities referenced by an Entity. Based on that one
> > can also do disambiguation (e.g. Solr MLT over the labels of the
> > semantic context with the labels of the current sentence; or MLT over
> > the URIs of the semantic Context with URIs of other extracted Entities
> > in the current sentence/text section of the whole document).
> >
> > best
> > Rupert
> >
> > On Thu, Aug 9, 2012 at 7:27 PM, kritarth anand <[email protected]
> >
> > wrote:
> > > I was not sure if spotlight approach would work for all kinds of
> > > vocabularies that Stanbol might have.
> > >
> > > I was concerned that the structure of vocabulary it assumes is
> satisfied
> > by
> > > dbpedia but might not be satisfied by any custom vocabulary we might
> have
> > > in any other deployment.
> > >
> > > On Thu, Aug 9, 2012 at 10:51 PM, Anuj Kumar <[email protected]>
> wrote:
> > >
> > >> Hi Kritarth,
> > >>
> > >> Thanks for the explanation. Spotlight approach sounds good to me but
> if
> > you
> > >> have time, it would be good to compare it with the other two for the
> > >> purpose of this study.
> > >>
> > >> On the third point, I am still not clear. Do you want to convey that
> > >> Spotlight's disambiguation algorithm can work only with DBpedia?
> > >>
> > >> Regards,
> > >> Anuj
> > >>
> > >> On Thu, Aug 9, 2012 at 8:18 PM, kritarth anand <
> > [email protected]
> > >> >wrote:
> > >>
> > >> > Dear Anuj,
> > >> >
> > >> > Sorry for Delayed reply.
> > >> >
> > >> > 1. In the current implementation of Stanbol what we see essentially
> > is.
> > >> >       a. We find all the entities in the given paragraph
> > >> >       b. For each entity query with a string of other entities as
> > >> > additional info to query dbpedia
> > >> >       c. Now we change the confidence values.
> > >> >
> > >> > 3. I'll answer this one first. I am not very sure of what Stanbol
> > expects
> > >> > from a vocabulary. All the other papers I read had seen were not
> > making
> > >> any
> > >> > assumptions on Vocabulary mainly they were using Wikipedia. I was
> > >> confused
> > >> > if it meant more flexibility. After discussion with Pablo and
> Rupert.
> > I
> > >> > think it is a way to go.
> > >> >
> > >> > 2. I am inclined towards using Spotlight approach as it seems to be
> > >> better
> > >> > than the other too and I would like comments from you guys if it is
> a
> > >> good
> > >> > way to proceed.
> > >> >
> > >> > Kritarth
> > >> >
> > >> >
> > >> > On Sun, Jul 29, 2012 at 11:29 AM, Anuj Kumar <[email protected]>
> > wrote:
> > >> >
> > >> > > Hi Kritarth,
> > >> > >
> > >> > > Thanks for sharing the details. I have few questions-
> > >> > >
> > >> > > 1. Can you elaborate the current implementation? Is it using the
> > >> existing
> > >> > > MLT feature?
> > >> > > 2. Which one of the three algorithms are you planning to use?
> > >> > > 3. On the spotlight part, can you explain more on why you say- "I
> am
> > >> not
> > >> > > sure if we can play around that much with any vocabulary and not
> > just
> > >> > > DBpedia."?
> > >> > >
> > >> > > Also, there is a minor typo in the report under Approach section-
> > "Yhe
> > >> > > behavior
> > >> > > can be explained as follows:"
> > >> > >
> > >> > > Thanks,
> > >> > > Anuj
> > >> > >
> > >> > > On Wed, Jul 25, 2012 at 3:20 PM, kritarth anand <
> > >> > [email protected]
> > >> > > >wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > I would like to start more interaction with the Stanbol
> Community
> > by
> > >> > > > sharing the first iteration of the Entity Disambiguation
> Engine. I
> > >> > would
> > >> > > > really like you all to take a look at it and give me your
> valuable
> > >> > > opinion.
> > >> > > >
> > >> > > > https://github.com/kritarthanand/Disambiguation-Stanbol
> > >> > > >
> > >> > > > The repo consists of the engines' code.It is very easy to
> install,
> > >> the
> > >> > > > instructions are present in the Readme file.
> > >> > > >
> > >> > > > Besides the engine it also contains my Mid Term Report which
> > >> describes
> > >> > > the
> > >> > > > engine a little and also talks about future possible algorithms
> > that
> > >> > can
> > >> > > be
> > >> > > > used for Entity Disambiguation. Disambiguation is a complex
> > problem
> > >> and
> > >> > > we
> > >> > > > should have an efficient and performs well too. Therefore I
> would
> > >> > really
> > >> > > > like Stanbol community to take part in discussion with
> Enthusiasm.
> > >> > > >
> > >> > > > Please share your views,
> > >> > > >
> > >> > > >
> > >> > > > Kritarth
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
> >
> > --
> > | Rupert Westenthaler             [email protected]
> > | Bodenlehenstraße 11                             ++43-699-11108907
> > | A-5500 Bischofshofen
> >
>
>
>
> --
> ---
> Pablo N. Mendes
> http://pablomendes.com
> Events: http://wole2012.eurecom.fr
>

Re: Entity Disambiguation: Midterm

Reply via email to