Hi Dileepa,

Congratulations again for your GSOC proposal. It's quite clear and well explained. Please, find some thoughts about your mail inline:

El 10/06/13 11:56, Dileepa Jayakody escribió:
Hi All,

I have started working on my GSOC project : FOAF co-reference based entity
disambiguation 
engine.<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1>
Last week I spent mainly reading papers on entity disambiguation and
Stanbol documentation specially on enhancement-structure [1] to gain an
overall idea on what I have to do in my project. I also looked at the
previous work of Solr MLT based entity disambiguation engine by Kritarth;
last year GSOC student and related to mail thread @stanbol-dev [2].

I would like to formulate a project plan based on a 'to do' in my project
and by incorporating all your suggestions/advice. I would very much like
your ideas, suggestions and pointers to relevant docs to enhance my
knowledge in the process.

My overview idea on the Stanbol enhancement process is;

1. parsing content
2. Content type, language detection
3. Named entity recognition (extract persons, organizations, places)
against a knowledge base or entity index where we have a known set of
entities (EntityHub)
4. List all suggested entities with a confidence (an identified noun,phrase
could refer to multiple entities)
     4.1. Group/cluster entities based on detected 'Named entities'
     4.2. Disambiguate entities
5. Show results
That could be the right workflow. Please, in order to reflect disambiguation results in the Enhancement Structure, consider to follow Rupert's comments at [STANBOL-1037]

*To-Do's in my project*

I intend to use the existing SolrMLT based disambiguation engine as a base
for my project since it's developed to work with any custom vocabulary. In
my project this vocabulary is FOAF. As per my understanding if I can
configure an *entity index* with SolrMLT based engine, then it can perform
disambiguation using that index. Currently the used entity-index is dbpedia
(please correct me if wrong).
In a previous mail-thread on the MLTbased engine it's mentioned:
"SolrMLT disambiguation Engine is based on the SimilarityConstraint
supported by FieldQuery interface implemented by the Stanbol Entityhub."
Can I use/extend the FieldQuery interface for my foaf based engine as well?
Look forward to your guidance on this.
I would need to take a deeper look into Disambiguation-MLT engine, but I would say that it wouldn't be enough just reusing or extending FieldQuery interface. AFAIK Disambiguation-MLT uses SolrMLT feature to "compare" (actually it's a text similarity measure) the context of the entities in the ContentItem with a configured field within the EntityHub. I think that for DBpedia, it was using Entities' short abstract. According to your proposal, you plan to use exact matching between FOAF properties (familyName, givenName...) and keywords in the ContentItem. So you might don't want to use similarity term frequencies approaches for that, because it wouldn't work well with text windows versus keywords. Maybe a co-occurrence analysis approach would fit better. In that sense, your problem is quite related to Word Sense Disambiguation and maybe some techniques in this field can be applied.

You also propose to use the relationships within the FOAF social graph for disambiguation. In my opinion, such approach can be generalized for any graph nature Knowledge Base like DBpedia or Freebase. There is also another GSOC proposal planning to explore graph based disambiguation engine, so maybe it would be great if both of you guys can collaborate on this.

In my project I will mainly need to following tasks as per my current
understanding;

1. Creating a EntityIndex capable of indicing a foaf dataset.
Underneath EntityHub Site could be dbpedia, freebase, openlink or
foaf-search [3] or any foaf datasource.
(I'm thinking whether it's a good idea to integrate foaf-search as an
entity index in Stanbol since it covers a large FOAF dataset and a REST api
to access data. WDYT?)  I also think there are many websites exposing their
contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore
it will be great to develop the FOAF EntityIndex as generic as possible,
de-coupled from the underneath Site. I look forward to necessary directions
and your help to create a generic architecture for this FOAF based entity
index.
One important issue to consider here is disambiguation data availability. I mean, where FOAF data is going to be stored? Are you planning to retrieve it 'on the fly' or should it be in a local Knowledge Base? To retrieve live data could be quite inefficient and you will be relying on thirdparty services. So, if you are going to store the FOAF data locally, then you need to decide how are you going to do it. Maybe for DBpedia EntityHub site, you can configure the indexing tool in Stanbol for harvesting also the FOAF information or, as you have said, you can create your custom site with FOAF information from another resources.

2. Developing the disambiguation algorithm.
My proposal is based on the FOAF-coreference based disambiguation algorithm
mentioned in the paper [4].
Later I came to know about concepts such as FOAF scuttering and smushing
[5] for FOAF based disambiguation.
Need to design a suitable algorithm to disambiguate over FOAF entities.
There are many research papers on machine learning co-reference techniques
for disambiguation. Look forward to your inputs on this.
Let me take a look to the paper. It seems promising :-)

In general it will be great to receive your ideas, pointers as much as
possible on my project to formulate a project plan.

Regards,
Dileepa

[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm
[2]
http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results
[3] http://www.foaf-search.net/
[4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations
with Rules and Machine Learning; In Proc. Proceedings of the Third
International Workshop on Social Data on the Web
[5] http://wiki.foaf-project.org/w/Smushing



--

------------------------------
This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, London W10 5JJ, UK.

Reply via email to