Thanks Kritarth, Rupert and Pablo.This brings in a lot of clarity. Regards, Anuj
On Fri, Aug 10, 2012 at 2:35 PM, Pablo N. Mendes <[email protected]>wrote: > Hi all, > It will perhaps be useful to organize the discussion around methods, rather > than implementations. Talking about implementations may be specially > confusing because: > 1) DBpedia Spotlight has DBpedia in the name. However, there are no > theoretical restrictions on the choice of KB, and not even actual technical > restrictions either, although in practice there might still a be few pieces > of hardcoded references in our codebase (which can be easily removed). > 2) DBpedia Spotlight is an open source Scala/Java tool for you to install > and use in house. However, it offers a web service deployment for > demonstration that for obvious reasons does not expose all of the possible > combinations of functionality that the underlying code is able to offer. > > Similarly to Stanbol, DBpedia Spotlight also assumes very little of > vocabularies. If all you have are labels, you can use our CandidateSearcher > and use a measure of "default sense" to pick a URI. We've experimented with > p(URI) as the overall prominence of an entity in the KB. We've also looked > at p(URI|label) as a measure for finding the default sense (a URI) for a > given label (the phrase found in text). > > Now, if you have labels *and context*, you can do a lot more. We also offer > a ContextSearcher where given a label and a piece of text, one can obtain a > rank of the most likely URIs given that context. Comparisons are made based > on cosine similarity between vectors with tf*icf weights (modified tf*idf). > In practical terms, we search a Lucene index using a custom similarity > class. The task is to compare a vector made out of the input text with many > vectors representing each entity in your target KB. We call these entity > representations "context". > > There are many ways to obtain context for entities at "training" time: > 1) Lesk-style: perhaps the oldest technique, models context based on > "definitions" of each entity (dictionary style). If the incoming text > contains many terms in the definition of entities, then you assume that the > entity is close in meaning to the text, therefore is the right one to pick. > 2) Shallow KB neighborhood: collects, for each entity, the labels of other > entities in the neighborhood based on the KB structure (this is what Rupert > mentioned). This is rather similar to Lesk-style, but has the cool feature > (in an RDF world) of not really requiring dictionary entries, but just > using the relationships in the KB to obtain more "words". > 3) Occurrence/Mention-based: collects examples where the entity is known to > have occurred / been mentioned. These examples are paragraphs mentioning > the entity (and usually also other entities). So when the input text looks > like one of these paragraphs (rather, the aggregation of all these > paragraphs) for an entity, we assume that the entity is the right one to > pick. > > > For all three cases above, the model of the context is a vector of words > and can, therefore, use either Stanbol or DBpedia Spotlight's > implementations. Note that 3 will include both 1 and 2 (guaranteed for > Wikipedia, expected in general for most reasonable training data), and > that's why DBpedia Spotlight uses that by default. However, in practical > terms, all that DBpedia Spotlight asks for is "some text" that the user can > be free to generate however he/she wants. > > Besides the 3 methods above, there are other graph-based algorithms, joint > inference for collective disambiguation algorithms, and so on. But I have > omitted them for brevity, as they are not directly related to the questions > raised by this thread. > > It would be interesting to compare 1, 2 and 3 so that users of Stanbol can > have an idea of minimal accuracy expected in different cases, and how they > can increase as you provide more context. > > Hope this helps. > > Cheers, > Pablo > > PS: I used "label" here where we usually use "surface form" in DBpedia > Spotlight. We consider "label" to be more like the "name" of an entity, or > the value for "rdfs:label", while "surface form" is any phrase used to > refer to an entity in text, even if it's not an rdfs:label. To keep it > simple, I also used "entity" where we usually talk about "resource" in > DBpedia Spotlight. > > > On Thu, Aug 9, 2012 at 10:06 PM, Rupert Westenthaler < > [email protected]> wrote: > > > Hi, > > > > Stanbol currently assumes very little of Vocabularies. Basically you > > need only an URI and a label to get an Entity suggested. > > > > If you want to do some kind of disambiguation you will clearly need > > more information about Entities. > > > > Here the question is what kind of information the "spotlight approach" > > needs. AFAIK this approach is based on "surface forms" - labels used > > to refer to an Entity and "mentions" - sentences that mentions an > > Entity. Kritarth please correct me if I get this wrong. But if this is > > correct users would need to provide "mentions" for being able to use > > DBpedia spotlight like disambiguations. > > > > I think other rather typical information would be the "semantic > > context" - other entities referenced by an Entity. Based on that one > > can also do disambiguation (e.g. Solr MLT over the labels of the > > semantic context with the labels of the current sentence; or MLT over > > the URIs of the semantic Context with URIs of other extracted Entities > > in the current sentence/text section of the whole document). > > > > best > > Rupert > > > > On Thu, Aug 9, 2012 at 7:27 PM, kritarth anand <[email protected] > > > > wrote: > > > I was not sure if spotlight approach would work for all kinds of > > > vocabularies that Stanbol might have. > > > > > > I was concerned that the structure of vocabulary it assumes is > satisfied > > by > > > dbpedia but might not be satisfied by any custom vocabulary we might > have > > > in any other deployment. > > > > > > On Thu, Aug 9, 2012 at 10:51 PM, Anuj Kumar <[email protected]> > wrote: > > > > > >> Hi Kritarth, > > >> > > >> Thanks for the explanation. Spotlight approach sounds good to me but > if > > you > > >> have time, it would be good to compare it with the other two for the > > >> purpose of this study. > > >> > > >> On the third point, I am still not clear. Do you want to convey that > > >> Spotlight's disambiguation algorithm can work only with DBpedia? > > >> > > >> Regards, > > >> Anuj > > >> > > >> On Thu, Aug 9, 2012 at 8:18 PM, kritarth anand < > > [email protected] > > >> >wrote: > > >> > > >> > Dear Anuj, > > >> > > > >> > Sorry for Delayed reply. > > >> > > > >> > 1. In the current implementation of Stanbol what we see essentially > > is. > > >> > a. We find all the entities in the given paragraph > > >> > b. For each entity query with a string of other entities as > > >> > additional info to query dbpedia > > >> > c. Now we change the confidence values. > > >> > > > >> > 3. I'll answer this one first. I am not very sure of what Stanbol > > expects > > >> > from a vocabulary. All the other papers I read had seen were not > > making > > >> any > > >> > assumptions on Vocabulary mainly they were using Wikipedia. I was > > >> confused > > >> > if it meant more flexibility. After discussion with Pablo and > Rupert. > > I > > >> > think it is a way to go. > > >> > > > >> > 2. I am inclined towards using Spotlight approach as it seems to be > > >> better > > >> > than the other too and I would like comments from you guys if it is > a > > >> good > > >> > way to proceed. > > >> > > > >> > Kritarth > > >> > > > >> > > > >> > On Sun, Jul 29, 2012 at 11:29 AM, Anuj Kumar <[email protected]> > > wrote: > > >> > > > >> > > Hi Kritarth, > > >> > > > > >> > > Thanks for sharing the details. I have few questions- > > >> > > > > >> > > 1. Can you elaborate the current implementation? Is it using the > > >> existing > > >> > > MLT feature? > > >> > > 2. Which one of the three algorithms are you planning to use? > > >> > > 3. On the spotlight part, can you explain more on why you say- "I > am > > >> not > > >> > > sure if we can play around that much with any vocabulary and not > > just > > >> > > DBpedia."? > > >> > > > > >> > > Also, there is a minor typo in the report under Approach section- > > "Yhe > > >> > > behavior > > >> > > can be explained as follows:" > > >> > > > > >> > > Thanks, > > >> > > Anuj > > >> > > > > >> > > On Wed, Jul 25, 2012 at 3:20 PM, kritarth anand < > > >> > [email protected] > > >> > > >wrote: > > >> > > > > >> > > > Hi all, > > >> > > > > > >> > > > I would like to start more interaction with the Stanbol > Community > > by > > >> > > > sharing the first iteration of the Entity Disambiguation > Engine. I > > >> > would > > >> > > > really like you all to take a look at it and give me your > valuable > > >> > > opinion. > > >> > > > > > >> > > > https://github.com/kritarthanand/Disambiguation-Stanbol > > >> > > > > > >> > > > The repo consists of the engines' code.It is very easy to > install, > > >> the > > >> > > > instructions are present in the Readme file. > > >> > > > > > >> > > > Besides the engine it also contains my Mid Term Report which > > >> describes > > >> > > the > > >> > > > engine a little and also talks about future possible algorithms > > that > > >> > can > > >> > > be > > >> > > > used for Entity Disambiguation. Disambiguation is a complex > > problem > > >> and > > >> > > we > > >> > > > should have an efficient and performs well too. Therefore I > would > > >> > really > > >> > > > like Stanbol community to take part in discussion with > Enthusiasm. > > >> > > > > > >> > > > Please share your views, > > >> > > > > > >> > > > > > >> > > > Kritarth > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > -- > > | Rupert Westenthaler [email protected] > > | Bodenlehenstraße 11 ++43-699-11108907 > > | A-5500 Bischofshofen > > > > > > -- > --- > Pablo N. Mendes > http://pablomendes.com > Events: http://wole2012.eurecom.fr >
