Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us
commenting there. If you have done already, could you send us the link?
All the best,
Thiago

On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta <a.gu...@gmail.com> wrote:

> Hi all,
>
> Here are some comments for your response:
>
>
>> Hi Abishek, thanks for the work, here are some answers:
>>
>> On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta <a.gu...@gmail.com>
>> wrote:
>>
>>> Hi Thiago,
>>>
>>> Sorry for the delay!
>>> I have set up the spotlight server and it is running perfectly fine but
>>> with minimal settings. After this set up I played with spotIight server
>>> during which I came across some discrepancies as follows:
>>>
>>> Example taken:
>>> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
>>> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
>>> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
>>> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
>>> the world. In 1990 German reunification took place in whole Germany in
>>> which the city regained its status as the capital of Germany.
>>>
>>> 1) If we run this we annotate "13th Century" to "
>>> http://dbpedia.org/page/19th_century";. This might be happening because
>>> the context is very much from 19th century and moreover in "13th Century"
>>> and "19th Century" there is minimal syntactic difference (one letter).
>>> But I am not sure whether this is good or bad.
>>>
>>
>> This might be due to either "13th Century" being wrongly linked to 19th
>> century, or maybe the word "century" being linked to many different
>> centuries which then causes a disambiguation error due to the context. I
>> think your example is a counter-example to the way we generate the data
>> structures used for disambiguation.
>>
>>
>>> In my opinion if we have an entity in our store (
>>> http://dbpedia.org/page/13th_century) which is perfectly matching with
>>> surface form in raw text ("13th Century") we should have annotated SF
>>> to the entity.
>>> And same might be the case with "Germany" which is associated to "History
>>> of Germany <http://dbpedia.org/page/History_of_Germany>" not "Germany
>>> <http://dbpedia.org/page/Germany>".
>>>
>>
>> In this case other factors might have crept in, in could be that Germany
>> has a bigger number of inlinks or some other metric that allows it to
>> overtake the most natural candidate.
>>
>>
>>>
>>> 2) We are spotting "place" and associating it with "Portland Place
>>> <http://dbpedia.org/resource/Portland_Place>", maybe due to stemming
>>> SF. And even "Location (geography)
>>> <http://dbpedia.org/page/Location_(geography)>" is not the correct
>>> entity type for this. This is because we are not able to detect the sense
>>> of the word "place" itself. So for that we may have to use word senses
>>> like from Wordnet etc.
>>>
>>
>> The sf spottling pipeline works a bit like this, you get a candidate SF,
>> like 'Portland Place' and see if there's a candidate for that, but you also
>> consider n-gram subparts, so it could have retrieved the candidates
>> associated with "place" instead.
>>
>
> I understand what you said but over here I wanted to point out that "
> place" is not even a noun and we are trying to associate it with an Named
> Entity which is a noun.
>
>
>>
>>
>>>
>>> 3) We are detecting ". Berlin" as a surface form. But I don't came to
>>> know where this SF comes from. And I suspect this SF doesn't come from the
>>> Wikipedia.
>>>
>>
>> Although ". Berlin" is highlighted, the entity is matched on "Berlin",
>> the extra space and punctuation comes from the way we tokenize sentences.
>> We have chosen to use a language independent tokenizer using a break
>> iterator for speed and language independence, but it hasn't been tested
>> very well. This is the area which explains this mistake and help in it is
>> much appreciated.
>>
>
> Thanks for clarification.
>
>
>>
>>
>>>
>>> 4) We spotted "capital of Germany" but I didn't get any candidates if
>>> we run for "candidates" instead of "annotate".
>>>
>>
>> This might be due to a default confidence score. If you pass the extra
>> confidence param and set it to 0, you will probably see everything, e.g.
>> /candidates/?confidence=0&text=....
>> In fact, I suggest you to see all the candidates in the text you used to
>> confirm (or not) what I've been saying here.
>>
>
> I tried to do that but I still didn't get any Entity Candidate for "capital
> of Germany".
>
>
>>
>>>
>>> 5) We are able to spot "1920s" as a surface form but not "1920".
>>>
>>
>> This is due to the generation /stemming of sfs we have been discussed,
>> but I'm not sure that is a bad example. 1920 if used as a year might no
>> mean the same as 1920s.
>>
>
> This was my mistake.
>
>
>>
>>
>
>>> Few more questions:
>>> 1) Are we trying to annotate every word, noun or entity(e.g. proper
>>> noun) in raw text? Because in the above link I found "documented" (a word
>>> not a noun or entity) annotated to "http://dbpedia.org/resource/Document
>>> ".
>>>
>>>
>> There are two main spotters, the default one that uses a finite state
>> automaton generated from the surface form store to match incoming words as
>> valid sequence of states (so in this sense everything goes through the
>> pipeline), another that uses a opennlp spotter that gets Sfs from a NE
>> extractor. Both might generate single noun n-grams. In this case, it could
>> be that there is a link in wikipedia "documented" -> Document, which might
>> introduce "documented" as a valid state in the FSA.
>>
>>
>>> 2) Are we using surface forms to deal with only syntactic references
>>> (e.g. surface form "municipality" referring to "Municipality
>>> <http://dbpedia.org/page/Municipality>" or "Metropolitan_municipality
>>> <http://dbpedia.org/page/Metropolitan_municipality>" or "
>>> Municipalities_of_Mexico
>>> <http://dbpedia.org/page/Municipalities_of_Mexico>") or both, syntactic
>>> and semantic references (e.g. aliases like "Third Reich" referring to "Nazi
>>> Germany <http://dbpedia.org/page/Nazi_Germany>")?
>>>
>>
>> No sure I get it, but everything is generated from a bottom up kind of
>> way, so the fact that "municipality" can be associated with the entities
>> (not the words) Municipality, Metropolitan_muncipality or,
>> Municipalities_of_Mexico, solely derives from the presence of wikipedia
>> links using that word to refer to those entities. Same thing for the other
>> example. If "Third Reich" and "Nazi Germany" is used to refer to a certain
>> period within German History, than that is so in virtue of the links.
>> Whether you can use this as a way to explain synonyms, I'm not sure. It
>> would make an interesting research project.
>>
>>
>>> I am working on generating extra possible surface forms from
>>> a canonical surface form or the entity itself to deal with unseen SF
>>> association problems.
>>> I have also started working on my proposal will also submit it soon.
>>>
>>
>> Well done and thanks for the effort.
>> Thanks, Thiago
>>
>>>
>>> Thanks,
>>> Abhishek
>>>
>>
> Besides this I also tried to annotate data with
> http://dbpedia-spotlight.github.io/demo/ as suggested by Joachim but
> still there are some issues with disambiguation:
> 1) "First" is annotated with "World War I" - Disambiguation Problem
> 2) "world" is not annotated. - Spotter Problem
> 3) "status" is not annotated but annotated with Lucene based version. -
> Spotter Problem
> 4) "Germany" is annotated with "Weimar Republic" instead of "Germany". -
> Disambiguation Problem
>
> I think above problems could be resolved if we extract only Nouns or Noun
> Phrases as Surface Forms and use a good disambiguator (like graph-based).
> I have also prepared my initial draft for application which explains my
> approach towards resolving these type of issues and suggesting two type of
> disambiguation approaches. Please review it and give your suggestions so
> that I can improve my approach.
> Initial Draft:
> https://docs.google.com/document/d/1U4BvJpGUvL2odVA6VxnYggfEX_hmLSYP4yqhXB7dLQU/edit
>
> Thanks,
> Abhishek
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to