Hi Dileepa,
Congratulations again for your GSOC proposal. It's quite clear and well
explained. Please, find some thoughts about your mail inline:
El 10/06/13 11:56, Dileepa Jayakody escribió:
Hi All,
I have started working on my GSOC project : FOAF co-reference based entity
disambiguation
engine.<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1>
Last week I spent mainly reading papers on entity disambiguation and
Stanbol documentation specially on enhancement-structure [1] to gain an
overall idea on what I have to do in my project. I also looked at the
previous work of Solr MLT based entity disambiguation engine by Kritarth;
last year GSOC student and related to mail thread @stanbol-dev [2].
I would like to formulate a project plan based on a 'to do' in my project
and by incorporating all your suggestions/advice. I would very much like
your ideas, suggestions and pointers to relevant docs to enhance my
knowledge in the process.
My overview idea on the Stanbol enhancement process is;
1. parsing content
2. Content type, language detection
3. Named entity recognition (extract persons, organizations, places)
against a knowledge base or entity index where we have a known set of
entities (EntityHub)
4. List all suggested entities with a confidence (an identified noun,phrase
could refer to multiple entities)
4.1. Group/cluster entities based on detected 'Named entities'
4.2. Disambiguate entities
5. Show results
That could be the right workflow. Please, in order to reflect
disambiguation results in the Enhancement Structure, consider to follow
Rupert's comments at [STANBOL-1037]
*To-Do's in my project*
I intend to use the existing SolrMLT based disambiguation engine as a base
for my project since it's developed to work with any custom vocabulary. In
my project this vocabulary is FOAF. As per my understanding if I can
configure an *entity index* with SolrMLT based engine, then it can perform
disambiguation using that index. Currently the used entity-index is dbpedia
(please correct me if wrong).
In a previous mail-thread on the MLTbased engine it's mentioned:
"SolrMLT disambiguation Engine is based on the SimilarityConstraint
supported by FieldQuery interface implemented by the Stanbol Entityhub."
Can I use/extend the FieldQuery interface for my foaf based engine as well?
Look forward to your guidance on this.
I would need to take a deeper look into Disambiguation-MLT engine, but I
would say that it wouldn't be enough just reusing or extending
FieldQuery interface. AFAIK Disambiguation-MLT uses SolrMLT feature to
"compare" (actually it's a text similarity measure) the context of the
entities in the ContentItem with a configured field within the
EntityHub. I think that for DBpedia, it was using Entities' short
abstract. According to your proposal, you plan to use exact matching
between FOAF properties (familyName, givenName...) and keywords in the
ContentItem. So you might don't want to use similarity term frequencies
approaches for that, because it wouldn't work well with text windows
versus keywords. Maybe a co-occurrence analysis approach would fit
better. In that sense, your problem is quite related to Word Sense
Disambiguation and maybe some techniques in this field can be applied.
You also propose to use the relationships within the FOAF social graph
for disambiguation. In my opinion, such approach can be generalized for
any graph nature Knowledge Base like DBpedia or Freebase. There is also
another GSOC proposal planning to explore graph based disambiguation
engine, so maybe it would be great if both of you guys can collaborate
on this.
In my project I will mainly need to following tasks as per my current
understanding;
1. Creating a EntityIndex capable of indicing a foaf dataset.
Underneath EntityHub Site could be dbpedia, freebase, openlink or
foaf-search [3] or any foaf datasource.
(I'm thinking whether it's a good idea to integrate foaf-search as an
entity index in Stanbol since it covers a large FOAF dataset and a REST api
to access data. WDYT?) I also think there are many websites exposing their
contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore
it will be great to develop the FOAF EntityIndex as generic as possible,
de-coupled from the underneath Site. I look forward to necessary directions
and your help to create a generic architecture for this FOAF based entity
index.
One important issue to consider here is disambiguation data
availability. I mean, where FOAF data is going to be stored? Are you
planning to retrieve it 'on the fly' or should it be in a local
Knowledge Base? To retrieve live data could be quite inefficient and you
will be relying on thirdparty services. So, if you are going to store
the FOAF data locally, then you need to decide how are you going to do
it. Maybe for DBpedia EntityHub site, you can configure the indexing
tool in Stanbol for harvesting also the FOAF information or, as you have
said, you can create your custom site with FOAF information from another
resources.
2. Developing the disambiguation algorithm.
My proposal is based on the FOAF-coreference based disambiguation algorithm
mentioned in the paper [4].
Later I came to know about concepts such as FOAF scuttering and smushing
[5] for FOAF based disambiguation.
Need to design a suitable algorithm to disambiguate over FOAF entities.
There are many research papers on machine learning co-reference techniques
for disambiguation. Look forward to your inputs on this.
Let me take a look to the paper. It seems promising :-)
In general it will be great to receive your ideas, pointers as much as
possible on my project to formulate a project plan.
Regards,
Dileepa
[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm
[2]
http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results
[3] http://www.foaf-search.net/
[4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations
with Rules and Machine Learning; In Proc. Proceedings of the Third
International Workshop on Social Data on the Web
[5] http://wiki.foaf-project.org/w/Smushing
--
------------------------------
This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
London W10 5JJ, UK.