On Tue, Jun 11, 2013 at 12:12 AM, Rafa Haro <rh...@zaizi.com> wrote:

> Hi Dileepa,
>
> Congratulations again for your GSOC proposal. It's quite clear and well
> explained. Please, find some thoughts about your mail inline:
>

Thanks a lot Rafa for your valuable input.

>
> El 10/06/13 11:56, Dileepa Jayakody escribió:
>
>> Hi All,
>>
>> I have started working on my GSOC project : FOAF co-reference based entity
>> disambiguation engine.<http://www.google-**melange.com/gsoc/proposal/**
>> review/google/gsoc2013/**dileepaj/1<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1>
>> >
>>
>> Last week I spent mainly reading papers on entity disambiguation and
>> Stanbol documentation specially on enhancement-structure [1] to gain an
>> overall idea on what I have to do in my project. I also looked at the
>> previous work of Solr MLT based entity disambiguation engine by Kritarth;
>> last year GSOC student and related to mail thread @stanbol-dev [2].
>>
>> I would like to formulate a project plan based on a 'to do' in my project
>> and by incorporating all your suggestions/advice. I would very much like
>> your ideas, suggestions and pointers to relevant docs to enhance my
>> knowledge in the process.
>>
>> My overview idea on the Stanbol enhancement process is;
>>
>> 1. parsing content
>> 2. Content type, language detection
>> 3. Named entity recognition (extract persons, organizations, places)
>> against a knowledge base or entity index where we have a known set of
>> entities (EntityHub)
>> 4. List all suggested entities with a confidence (an identified
>> noun,phrase
>> could refer to multiple entities)
>>      4.1. Group/cluster entities based on detected 'Named entities'
>>      4.2. Disambiguate entities
>> 5. Show results
>>
> That could be the right workflow. Please, in order to reflect
> disambiguation results in the Enhancement Structure, consider to follow
> Rupert's comments at [STANBOL-1037]
>

I will go through the JIRA to get a better idea on the enhancement
structure and disambiguation.

>
>> *To-Do's in my project*
>>
>>
>> I intend to use the existing SolrMLT based disambiguation engine as a base
>> for my project since it's developed to work with any custom vocabulary. In
>> my project this vocabulary is FOAF. As per my understanding if I can
>> configure an *entity index* with SolrMLT based engine, then it can perform
>>
>> disambiguation using that index. Currently the used entity-index is
>> dbpedia
>> (please correct me if wrong).
>> In a previous mail-thread on the MLTbased engine it's mentioned:
>> "SolrMLT disambiguation Engine is based on the SimilarityConstraint
>> supported by FieldQuery interface implemented by the Stanbol Entityhub."
>> Can I use/extend the FieldQuery interface for my foaf based engine as
>> well?
>> Look forward to your guidance on this.
>>
> I would need to take a deeper look into Disambiguation-MLT engine, but I
> would say that it wouldn't be enough just reusing or extending FieldQuery
> interface. AFAIK Disambiguation-MLT uses SolrMLT feature to "compare"
> (actually it's a text similarity measure) the context of the entities in
> the ContentItem with a configured field within the EntityHub. I think that
> for DBpedia, it was using Entities' short abstract. According to your
> proposal, you plan to use exact matching between FOAF properties
> (familyName, givenName...) and keywords in the ContentItem. So you might
> don't want to use similarity term frequencies approaches for that, because
> it wouldn't work well with text windows versus keywords. Maybe a
> co-occurrence analysis approach would fit better. In that sense, your
> problem is quite related to Word Sense Disambiguation and maybe some
> techniques in this field can be applied.
>
> You also propose to use the relationships within the FOAF social graph for
> disambiguation. In my opinion, such approach can be generalized for any
> graph nature Knowledge Base like DBpedia or Freebase. There is also another
> GSOC proposal planning to explore graph based disambiguation engine, so
> maybe it would be great if both of you guys can collaborate on this.

Yes that would be great, to work with Antonio on this. freebase dataset
also has a significant amount of FOAF data.

>
>
>> In my project I will mainly need to following tasks as per my current
>> understanding;
>>
>> 1. Creating a EntityIndex capable of indicing a foaf dataset.
>> Underneath EntityHub Site could be dbpedia, freebase, openlink or
>> foaf-search [3] or any foaf datasource.
>> (I'm thinking whether it's a good idea to integrate foaf-search as an
>> entity index in Stanbol since it covers a large FOAF dataset and a REST
>> api
>> to access data. WDYT?)  I also think there are many websites exposing
>> their
>> contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore
>> it will be great to develop the FOAF EntityIndex as generic as possible,
>> de-coupled from the underneath Site. I look forward to necessary
>> directions
>> and your help to create a generic architecture for this FOAF based entity
>> index.
>>
> One important issue to consider here is disambiguation data availability.
> I mean, where FOAF data is going to be stored? Are you planning to retrieve
> it 'on the fly' or should it be in a local Knowledge Base? To retrieve live
> data could be quite inefficient and you will be relying on thirdparty
> services. So, if you are going to store the FOAF data locally, then you
> need to decide how are you going to do it. Maybe for DBpedia EntityHub
> site, you can configure the indexing tool in Stanbol for harvesting also
> the FOAF information or, as you have said, you can create your custom site
> with FOAF information from another resources.


I have been searching for a significant and valid set of FOAF data for my
project (also mailed foaf-dev on this). Most of the data-sets listed on
[6,7] seem to be obsolete. There are  projects like iwlearn,
opera-community exposing their contacts as FOAF, but I'm afraid that will
not include data about well-known personalities like presidents,
celebrities as in Wikipedia/DBpedia. So yes, this is a main task we need to
decide on the project; what is the datasource to be used.

My suggestion on integrating foaf-search [3] would basically need to do a
on-the-fly retrieval of data, but as you have pointed out it could impose a
performance hit. But foaf-search looks promising with a big index of FOAF
data.
I think configuring DBpedia site for it's FOAF data is the best starting
point for my project. I guess 'dbpedia-ont:Person' type can be matched with
FOAF Person for this purpose. WDYT?

Thanks,
Dileepa

>
>
>> 2. Developing the disambiguation algorithm.
>> My proposal is based on the FOAF-coreference based disambiguation
>> algorithm
>> mentioned in the paper [4].
>> Later I came to know about concepts such as FOAF scuttering and smushing
>> [5] for FOAF based disambiguation.
>> Need to design a suitable algorithm to disambiguate over FOAF entities.
>> There are many research papers on machine learning co-reference techniques
>> for disambiguation. Look forward to your inputs on this.
>>
> Let me take a look to the paper. It seems promising :-)
>
>
>> In general it will be great to receive your ideas, pointers as much as
>> possible on my project to formulate a project plan.
>>
>> Regards,
>> Dileepa
>>
>> [1]
>> http://stanbol.apache.org/**docs/trunk/components/**
>> enhancer/enhancementstructure.**htm<http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm>
>> [2]
>> http://markmail.org/message/**aubyruemy324o7ut#query:+page:**
>> 1+mid:do5xwwfs3333w72c+state:**results<http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results>
>> [3] http://www.foaf-search.net/
>> [4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations
>> with Rules and Machine Learning; In Proc. Proceedings of the Third
>> International Workshop on Social Data on the Web
>> [5] 
>> http://wiki.foaf-project.org/**w/Smushing<http://wiki.foaf-project.org/w/Smushing>
>>
>>  [6] http://www.w3.org/wiki/FoafSites
>
   [7] http://wiki.foaf-project.org/w/DataSources

>
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
> London W10 5JJ, UK.
>

Reply via email to