Hi All,

I have started working on my GSOC project : FOAF co-reference based entity
disambiguation 
engine.<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1>
Last week I spent mainly reading papers on entity disambiguation and
Stanbol documentation specially on enhancement-structure [1] to gain an
overall idea on what I have to do in my project. I also looked at the
previous work of Solr MLT based entity disambiguation engine by Kritarth;
last year GSOC student and related to mail thread @stanbol-dev [2].

I would like to formulate a project plan based on a 'to do' in my project
and by incorporating all your suggestions/advice. I would very much like
your ideas, suggestions and pointers to relevant docs to enhance my
knowledge in the process.

My overview idea on the Stanbol enhancement process is;

1. parsing content
2. Content type, language detection
3. Named entity recognition (extract persons, organizations, places)
against a knowledge base or entity index where we have a known set of
entities (EntityHub)
4. List all suggested entities with a confidence (an identified noun,phrase
could refer to multiple entities)
    4.1. Group/cluster entities based on detected 'Named entities'
    4.2. Disambiguate entities
5. Show results

*To-Do's in my project*

I intend to use the existing SolrMLT based disambiguation engine as a base
for my project since it's developed to work with any custom vocabulary. In
my project this vocabulary is FOAF. As per my understanding if I can
configure an *entity index* with SolrMLT based engine, then it can perform
disambiguation using that index. Currently the used entity-index is dbpedia
(please correct me if wrong).
In a previous mail-thread on the MLTbased engine it's mentioned:
"SolrMLT disambiguation Engine is based on the SimilarityConstraint
supported by FieldQuery interface implemented by the Stanbol Entityhub."
Can I use/extend the FieldQuery interface for my foaf based engine as well?
Look forward to your guidance on this.

In my project I will mainly need to following tasks as per my current
understanding;

1. Creating a EntityIndex capable of indicing a foaf dataset.
Underneath EntityHub Site could be dbpedia, freebase, openlink or
foaf-search [3] or any foaf datasource.
(I'm thinking whether it's a good idea to integrate foaf-search as an
entity index in Stanbol since it covers a large FOAF dataset and a REST api
to access data. WDYT?)  I also think there are many websites exposing their
contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore
it will be great to develop the FOAF EntityIndex as generic as possible,
de-coupled from the underneath Site. I look forward to necessary directions
and your help to create a generic architecture for this FOAF based entity
index.

2. Developing the disambiguation algorithm.
My proposal is based on the FOAF-coreference based disambiguation algorithm
mentioned in the paper [4].
Later I came to know about concepts such as FOAF scuttering and smushing
[5] for FOAF based disambiguation.
Need to design a suitable algorithm to disambiguate over FOAF entities.
There are many research papers on machine learning co-reference techniques
for disambiguation. Look forward to your inputs on this.

In general it will be great to receive your ideas, pointers as much as
possible on my project to formulate a project plan.

Regards,
Dileepa

[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm
[2]
http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results
[3] http://www.foaf-search.net/
[4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations
with Rules and Machine Learning; In Proc. Proceedings of the Third
International Workshop on Social Data on the Web
[5] http://wiki.foaf-project.org/w/Smushing

Reply via email to