Hi Abishek, thanks for the work, here are some answers:

Hi Thiago,
Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century";. This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.

This might be due to either "13th Century" being wrongly linked to 19th
century, or maybe the word "century" being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.

> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany <http://dbpedia.org/page/History_of_Germany>" not "Germany
> <http://dbpedia.org/page/Germany>".

In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.

> 2) We are spotting "place" and associating it with "Portland Place
> <http://dbpedia.org/resource/Portland_Place>", maybe due to stemming SF.
> And even "Location (geography)
> <http://dbpedia.org/page/Location_(geography)>" is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.

The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with "place" instead.

> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.

Although ". Berlin" is highlighted, the entity is matched on "Berlin", the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much

> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".

This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.

> 5) We are able to spot "1920s" as a surface form but not "1920".

This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.

> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document";.
There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match incoming words as
valid sequence of states (so in this sense everything goes through the
pipeline), another that uses a opennlp spotter that gets Sfs from a NE
extractor. Both might generate single noun n-grams. In this case, it could
be that there is a link in wikipedia "documented" -> Document, which might
introduce "documented" as a valid state in the FSA.

> 2) Are we using surface forms to deal with only syntactic references (e.g.
> surface form "municipality" referring to "Municipality
> <http://dbpedia.org/page/Municipality>" or "Metropolitan_municipality
> <http://dbpedia.org/page/Metropolitan_municipality>" or "
> Municipalities_of_Mexico
> <http://dbpedia.org/page/Municipalities_of_Mexico>") or both, syntactic
> and semantic references (e.g. aliases like "Third Reich" referring to "Nazi
> Germany <http://dbpedia.org/page/Nazi_Germany>")?

No sure I get it, but everything is generated from a bottom up kind of way,
so the fact that "municipality" can be associated with the entities (not
the words) Municipality, Metropolitan_muncipality or,
Municipalities_of_Mexico, solely derives from the presence of wikipedia
links using that word to refer to those entities. Same thing for the other
example. If "Third Reich" and "Nazi Germany" is used to refer to a certain
period within German History, than that is so in virtue of the links.
Whether you can use this as a way to explain synonyms, I'm not sure. It
would make an interesting research project.

> I am working on generating extra possible surface forms from
> a canonical surface form or the entity itself to deal with unseen SF
> association problems.
> I have also started working on my proposal will also submit it soon.

Well done and thanks for the effort.
Thanks, Thiago

Thanks,
Abhishek
On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery wrote:
>> Hi Abhishek, thanks for the contribution. Your suggestions are pretty
>> much aligned with what we where thinking in any event, and the initial plan
>> seems good.
>> On the assumption that there's some code that generates extra possible
>> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
>> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
>> literature on Machine Translation on how to establish some score for the
>> surface form. That is, if you spot 'M Jordan' on the text, what is the
>> probability of it being a translation of the canonical name 'Michael
>> Jordan' .  If there's a simple way to implement this, we could try to get
>> the raw data with counts, generate some extra sfs in a principle manner and
>> use that to calculate probabilities. Still for the moment, I'd focus on
>> setting the spotlight server up and play with the warm up tasks.
>> Thanks for the good work,
>> Thiago
