On Tue, Jun 11, 2013 at 12:12 AM, Rafa Haro <rh...@zaizi.com> wrote: > Hi Dileepa, > > Congratulations again for your GSOC proposal. It's quite clear and well > explained. Please, find some thoughts about your mail inline: >
Thanks a lot Rafa for your valuable input. > > El 10/06/13 11:56, Dileepa Jayakody escribió: > >> Hi All, >> >> I have started working on my GSOC project : FOAF co-reference based entity >> disambiguation engine.<http://www.google-**melange.com/gsoc/proposal/** >> review/google/gsoc2013/**dileepaj/1<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/dileepaj/1> >> > >> >> Last week I spent mainly reading papers on entity disambiguation and >> Stanbol documentation specially on enhancement-structure [1] to gain an >> overall idea on what I have to do in my project. I also looked at the >> previous work of Solr MLT based entity disambiguation engine by Kritarth; >> last year GSOC student and related to mail thread @stanbol-dev [2]. >> >> I would like to formulate a project plan based on a 'to do' in my project >> and by incorporating all your suggestions/advice. I would very much like >> your ideas, suggestions and pointers to relevant docs to enhance my >> knowledge in the process. >> >> My overview idea on the Stanbol enhancement process is; >> >> 1. parsing content >> 2. Content type, language detection >> 3. Named entity recognition (extract persons, organizations, places) >> against a knowledge base or entity index where we have a known set of >> entities (EntityHub) >> 4. List all suggested entities with a confidence (an identified >> noun,phrase >> could refer to multiple entities) >> 4.1. Group/cluster entities based on detected 'Named entities' >> 4.2. Disambiguate entities >> 5. Show results >> > That could be the right workflow. Please, in order to reflect > disambiguation results in the Enhancement Structure, consider to follow > Rupert's comments at [STANBOL-1037] > I will go through the JIRA to get a better idea on the enhancement structure and disambiguation. > >> *To-Do's in my project* >> >> >> I intend to use the existing SolrMLT based disambiguation engine as a base >> for my project since it's developed to work with any custom vocabulary. In >> my project this vocabulary is FOAF. As per my understanding if I can >> configure an *entity index* with SolrMLT based engine, then it can perform >> >> disambiguation using that index. Currently the used entity-index is >> dbpedia >> (please correct me if wrong). >> In a previous mail-thread on the MLTbased engine it's mentioned: >> "SolrMLT disambiguation Engine is based on the SimilarityConstraint >> supported by FieldQuery interface implemented by the Stanbol Entityhub." >> Can I use/extend the FieldQuery interface for my foaf based engine as >> well? >> Look forward to your guidance on this. >> > I would need to take a deeper look into Disambiguation-MLT engine, but I > would say that it wouldn't be enough just reusing or extending FieldQuery > interface. AFAIK Disambiguation-MLT uses SolrMLT feature to "compare" > (actually it's a text similarity measure) the context of the entities in > the ContentItem with a configured field within the EntityHub. I think that > for DBpedia, it was using Entities' short abstract. According to your > proposal, you plan to use exact matching between FOAF properties > (familyName, givenName...) and keywords in the ContentItem. So you might > don't want to use similarity term frequencies approaches for that, because > it wouldn't work well with text windows versus keywords. Maybe a > co-occurrence analysis approach would fit better. In that sense, your > problem is quite related to Word Sense Disambiguation and maybe some > techniques in this field can be applied. > > You also propose to use the relationships within the FOAF social graph for > disambiguation. In my opinion, such approach can be generalized for any > graph nature Knowledge Base like DBpedia or Freebase. There is also another > GSOC proposal planning to explore graph based disambiguation engine, so > maybe it would be great if both of you guys can collaborate on this. Yes that would be great, to work with Antonio on this. freebase dataset also has a significant amount of FOAF data. > > >> In my project I will mainly need to following tasks as per my current >> understanding; >> >> 1. Creating a EntityIndex capable of indicing a foaf dataset. >> Underneath EntityHub Site could be dbpedia, freebase, openlink or >> foaf-search [3] or any foaf datasource. >> (I'm thinking whether it's a good idea to integrate foaf-search as an >> entity index in Stanbol since it covers a large FOAF dataset and a REST >> api >> to access data. WDYT?) I also think there are many websites exposing >> their >> contact data as FOAF (eg: http://iwlearn.net/, opera-community) Therefore >> it will be great to develop the FOAF EntityIndex as generic as possible, >> de-coupled from the underneath Site. I look forward to necessary >> directions >> and your help to create a generic architecture for this FOAF based entity >> index. >> > One important issue to consider here is disambiguation data availability. > I mean, where FOAF data is going to be stored? Are you planning to retrieve > it 'on the fly' or should it be in a local Knowledge Base? To retrieve live > data could be quite inefficient and you will be relying on thirdparty > services. So, if you are going to store the FOAF data locally, then you > need to decide how are you going to do it. Maybe for DBpedia EntityHub > site, you can configure the indexing tool in Stanbol for harvesting also > the FOAF information or, as you have said, you can create your custom site > with FOAF information from another resources. I have been searching for a significant and valid set of FOAF data for my project (also mailed foaf-dev on this). Most of the data-sets listed on [6,7] seem to be obsolete. There are projects like iwlearn, opera-community exposing their contacts as FOAF, but I'm afraid that will not include data about well-known personalities like presidents, celebrities as in Wikipedia/DBpedia. So yes, this is a main task we need to decide on the project; what is the datasource to be used. My suggestion on integrating foaf-search [3] would basically need to do a on-the-fly retrieval of data, but as you have pointed out it could impose a performance hit. But foaf-search looks promising with a big index of FOAF data. I think configuring DBpedia site for it's FOAF data is the best starting point for my project. I guess 'dbpedia-ont:Person' type can be matched with FOAF Person for this purpose. WDYT? Thanks, Dileepa > > >> 2. Developing the disambiguation algorithm. >> My proposal is based on the FOAF-coreference based disambiguation >> algorithm >> mentioned in the paper [4]. >> Later I came to know about concepts such as FOAF scuttering and smushing >> [5] for FOAF based disambiguation. >> Need to design a suitable algorithm to disambiguate over FOAF entities. >> There are many research papers on machine learning co-reference techniques >> for disambiguation. Look forward to your inputs on this. >> > Let me take a look to the paper. It seems promising :-) > > >> In general it will be great to receive your ideas, pointers as much as >> possible on my project to formulate a project plan. >> >> Regards, >> Dileepa >> >> [1] >> http://stanbol.apache.org/**docs/trunk/components/** >> enhancer/enhancementstructure.**htm<http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.htm> >> [2] >> http://markmail.org/message/**aubyruemy324o7ut#query:+page:** >> 1+mid:do5xwwfs3333w72c+state:**results<http://markmail.org/message/aubyruemy324o7ut#query:+page:1+mid:do5xwwfs3333w72c+state:results> >> [3] http://www.foaf-search.net/ >> [4] Jennifer Sleeman, and Tim Finin. Computing FOAF Co-reference Relations >> with Rules and Machine Learning; In Proc. Proceedings of the Third >> International Workshop on Social Data on the Web >> [5] >> http://wiki.foaf-project.org/**w/Smushing<http://wiki.foaf-project.org/w/Smushing> >> >> [6] http://www.w3.org/wiki/FoafSites > [7] http://wiki.foaf-project.org/w/DataSources > > > -- > > ------------------------------ > This message should be regarded as confidential. If you have received this > email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard copy > by an authorised signatory. > > Zaizi Ltd is registered in England and Wales with the registration number > 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, > London W10 5JJ, UK. >