Re: [GSOC] Entity Disambiguation Engine based on FOAF co-reference

Rafa Haro Mon, 10 Jun 2013 11:43:09 -0700

Hi Dileepa,

Congratulations again for your GSOC proposal. It's quite clear and wellexplained. Please, find some thoughts about your mail inline:


El 10/06/13 11:56, Dileepa Jayakody escribió:

Hi All,

I have started working on my GSOC project : FOAF co-reference based entity
disambiguation 
engine.<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1>
Last week I spent mainly reading papers on entity disambiguation and
Stanbol documentation specially on enhancement-structure [1] to gain an
overall idea on what I have to do in my project. I also looked at the
previous work of Solr MLT based entity disambiguation engine by Kritarth;
last year GSOC student and related to mail thread @stanbol-dev [2].

I would like to formulate a project plan based on a 'to do' in my project
and by incorporating all your suggestions/advice. I would very much like
your ideas, suggestions and pointers to relevant docs to enhance my
knowledge in the process.

My overview idea on the Stanbol enhancement process is;

1. parsing content
2. Content type, language detection
3. Named entity recognition (extract persons, organizations, places)
against a knowledge base or entity index where we have a known set of
entities (EntityHub)
4. List all suggested entities with a confidence (an identified noun,phrase
could refer to multiple entities)
     4.1. Group/cluster entities based on detected 'Named entities'
     4.2. Disambiguate entities
5. Show results

That could be the right workflow. Please, in order to reflectdisambiguation results in the Enhancement Structure, consider to followRupert's comments at [STANBOL-1037]


*To-Do's in my project*

I intend to use the existing SolrMLT based disambiguation engine as a base
for my project since it's developed to work with any custom vocabulary. In
my project this vocabulary is FOAF. As per my understanding if I can
configure an *entity index* with SolrMLT based engine, then it can perform
disambiguation using that index. Currently the used entity-index is dbpedia
(please correct me if wrong).
In a previous mail-thread on the MLTbased engine it's mentioned:
"SolrMLT disambiguation Engine is based on the SimilarityConstraint
supported by FieldQuery interface implemented by the Stanbol Entityhub."
Can I use/extend the FieldQuery interface for my foaf based engine as well?
Look forward to your guidance on this.

I would need to take a deeper look into Disambiguation-MLT engine, but Iwould say that it wouldn't be enough just reusing or extendingFieldQuery interface. AFAIK Disambiguation-MLT uses SolrMLT feature to"compare" (actually it's a text similarity measure) the context of theentities in the ContentItem with a configured field within theEntityHub. I think that for DBpedia, it was using Entities' shortabstract. According to your proposal, you plan to use exact matchingbetween FOAF properties (familyName, givenName...) and keywords in theContentItem. So you might don't want to use similarity term frequenciesapproaches for that, because it wouldn't work well with text windowsversus keywords. Maybe a co-occurrence analysis approach would fitbetter. In that sense, your problem is quite related to Word SenseDisambiguation and maybe some techniques in this field can be applied.

You also propose to use the relationships within the FOAF social graphfor disambiguation. In my opinion, such approach can be generalized forany graph nature Knowledge Base like DBpedia or Freebase. There is alsoanother GSOC proposal planning to explore graph based disambiguationengine, so maybe it would be great if both of you guys can collaborateon this.


In my project I will mainly need to following tasks as per my current
understanding;

1. Creating a EntityIndex capable of indicing a foaf dataset.
Underneath EntityHub Site could be dbpedia, freebase, openlink or
foaf-search [3] or any foaf datasource.
(I'm thinking whether it's a good idea to integrate foaf-search as an
entity index in Stanbol since it covers a large FOAF dataset and a REST api
to access data. WDYT?)  I also think there are many websites exposing their
contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore
it will be great to develop the FOAF EntityIndex as generic as possible,
de-coupled from the underneath Site. I look forward to necessary directions
and your help to create a generic architecture for this FOAF based entity
index.

One important issue to consider here is disambiguation dataavailability. I mean, where FOAF data is going to be stored? Are youplanning to retrieve it 'on the fly' or should it be in a localKnowledge Base? To retrieve live data could be quite inefficient and youwill be relying on thirdparty services. So, if you are going to storethe FOAF data locally, then you need to decide how are you going to doit. Maybe for DBpedia EntityHub site, you can configure the indexingtool in Stanbol for harvesting also the FOAF information or, as you havesaid, you can create your custom site with FOAF information from anotherresources.


2. Developing the disambiguation algorithm.
My proposal is based on the FOAF-coreference based disambiguation algorithm
mentioned in the paper [4].
Later I came to know about concepts such as FOAF scuttering and smushing
[5] for FOAF based disambiguation.
Need to design a suitable algorithm to disambiguate over FOAF entities.
There are many research papers on machine learning co-reference techniques
for disambiguation. Look forward to your inputs on this.

Let me take a look to the paper. It seems promising :-)


In general it will be great to receive your ideas, pointers as much as
possible on my project to formulate a project plan.

Regards,
Dileepa

[1]
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm
[2]
http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results
[3] http://www.foaf-search.net/
[4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations
with Rules and Machine Learning; In Proc. Proceedings of the Third
International Workshop on Social Data on the Web
[5] http://wiki.foaf-project.org/w/Smushing



--

------------------------------

This message should be regarded as confidential. If you have received thisemail in error please notify the sender and destroy it immediately.Statements of intent shall only become binding when confirmed in hard copyby an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,London W10 5JJ, UK.

Re: [GSOC] Entity Disambiguation Engine based on FOAF co-reference

Reply via email to