Hey,

Thanks for pointing me to that! I've managed to download and build the code
and started diving in yesterday, starting from the DBTwoStepDisambiguator.
Here's how I understand it so far:


The DBTwoStepDisambiguator first searches for all possible candidates in
the input text using the DBCandidateSearcher. If there are too many
candidates, it does an initial filtering of the candidates by their prior
probabilities, which is given by the candidate's support divided by it's
total number of annotated occurrences. This is the first step.

Next, the best candidate is selected in the disambiguation step. A
ContextSimilarity instance is queried with a sequence holding the context
token types, and the set of candidates that was determined in the first
step. From what I can see, the GenerativeContextSimilarity class is
generally used, which in turn holds a ContextStore with the actual vectors.
This returns a map that gives a probability/score for each resource
candidate, roughly corresponding to P(context | entity). From what i can
see, this is also the main part of where a vector model is relevant.

Then there are some probability calculations, to determine the similarity
scores for each candidate, and the NIL entity: These are given as a mixture
(UnweightedMixture it seems) of some probabilities for each candidate: The
entity prior P(e), the conditional prob of a context given the entity P(c |
e), and some the conditional prob related to the candidate support vs. the
candidate resource support P(s | e) (I'm not sure what this is exactly).
The scores are then normalized using a softmax. Finally, each surface
form's candidate entity with the highest score is chosen and returned.


Is a fair assessment of what's going on? There are some details that I'm
not 100% sure about, but all in all I have an idea of what would need to be
done to integrate new vector models as a replacement for the simple context
vector model.

Should I ask more questions here, or go ahead and open a proposal on
Melange with what my initial plan would be?

Cheers,
Philipp

On Fri, Mar 20, 2015 at 10:41 AM, Joachim Daiber <daiber.joac...@gmail.com>
wrote:

> Hi Philipp,
>
> we are very interested in topic 5.15, which I think can be tackled quite
> nicely in the GSoC format. I would suggest you submit a proposal to
> Melange, so we can give feedback there. In general, for this project, it
> would be good to familiarize yourself with the way the disambigutation is
> performed in Spotlight (currently with 'dumb' context vectors; this is
> probably a good starting point in the code: [1]).
>
> Best,
> Jo
>
> [1]
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/core/src/main/scala/org/dbpedia/spotlight/db/DBTwoStepDisambiguator.scala
>
> On Tue, Mar 17, 2015 at 4:31 PM, Philipp Dowling <
> philipp.dowl...@gmail.com> wrote:
>
>> Hey everyone,
>>
>> My name is Philipp, I'm from Germany and I'm happy to meet you guys! I
>> mostly do work in computational linguistics and NLP, so DBpedia was one of
>> the most interesting projects in GSoC this year for me. My main strengths
>> and/or interests are continuous space vector models, neural networks and
>> information retrieval.
>>
>> A little bit about me: I'm just about finished with my undergrad studies
>> in Munich, and will start my masters next. Most recently, I was in Hong
>> Kong for my B.Sc. thesis, conducting research on semantic MT evaluation. I
>> also work at a local startup, building data mining and knowledge discovery
>> systems.
>>
>> To be more specific about my interests for GSoC: I'm most interested in
>> tasks 5.15, 5.1, 5.9 and 5.12 (roughly in that order).
>> 5.15 especially overlaps a lot with research I've been doing for my
>> thesis, where I investigated the performance of continuous space models
>> such as Word2Vec as a replacement for discrete context vectors, with very
>> positive results. I got very familiar with different vector models from
>> this, and would love to now continue working on something like this in a
>> knowledge mining context.
>> I also got exposed to frame semantics a little in the same context, and
>> I'm currently working on knowledge mining, so 5.1 would also be a very
>> interesting project.
>> I'll come back with more specific questions when I've gotten a chance to
>> look at everything else in detail, but overall I'm very excited to start
>> getting to work!
>>
>> I'll get into some of the warm up tasks as soon as I get a chance. I
>> haven't worked with DBpedia much before, so it'll be interesting to dive
>> into the code base.
>>
>> Cheers,
>> Philipp Dowling
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to