Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia
Hi Abhishek, You are free to contribute :) I will try to keep on reviewing PRs if that is alright. On Tue, Apr 28, 2015 at 7:47 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, My proposal has not been selected for GSoC. But I am still want to continue with my project. So can someone provide me any guidelines (if I can continue)? Thanks, Abhishek On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Thanks for your reply and assurance. Moreover I replied your question for the extraction framework and I have also created an issue regarding using bold instances as the probable surface forms here https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/353. Thanks, Abhishek On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote: Hi Abhishek, sorry for taking so long to write to you. Things at work have been really busy. About the issue you raised about the originality of your proposal, rest assured that no one sent a proposal similar to yours. I'm happy that you send a PR for the extraction framework. It seems that Dimitris is already taking a look at it. As for your suggestions in Spotlight, just removing the stopword filter is something that I don't advise that much, cause I remember getting a lot of crap once. Maybe it should be modified somehow. If you have a good idea and want to send a PR, it would be very welcome. I think discussing things on github would be better. All the best, Thiago On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, Recently I was checking out the indexing process of dbpedia-spotlight and I observe a certain things: 1) There is a missing constructor definition in wikiPage object https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiPage.scala for instance defined in function wikiPageCopy here https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/io/DisambiguationContextSource.scala#L67. For this I have created an PR https://github.com/dbpedia/extraction-framework/pull/377 2) For stopwords filter defined here https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186, I did an analysis over the conceptURI's extraction with stopwords list here http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list. From the analysis it came out that we are neglecting around 25481 entities in which almost all of them are from important category like music, film, band etc. E.g. Am_(musician) http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film) http://en.wikipedia.org/wiki/Home_(2015_film), The_Who http://en.wikipedia.org/wiki/The_Who etc. And if we do case sensitive checking (checking if entity contains more than one capital alphabets as one is default) even then we will reject some entities which has only one word like Am, Home etc. Moreover the garbage (can't etc.) we will incur after removing this filter won't be much. So i suggest if we can remove this filter. 3) I would like to suggest a surface form extraction. If we can extract bold text in the first line of the wikipedia then we can use that as probable Surface Form for that entity. E.g. Stanford_University http://en.wikipedia.org/wiki/Stanford_University, Aon_(company) http://en.wikipedia.org/wiki/Aon_%28company%29, Radio_Warwick http://en.wikipedia.org/wiki/Radio_Warwick, Phi_Gamma_Delta http://en.wikipedia.org/wiki/Phi_Gamma_Delta etc. These are the best Surface Forms for the respective Entity. Thanks, Abhishek On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, I would also like to inform that in one of the recent mails my proposal has been gone public when Thiago accidentally sent a mail to me and dbpedia-gsoc mailing list. Details of the mails are below. The Google docs link was there in the quotes and the doc can be seen and even edited by anyone with that link, but nobody have changed the content of the doc. And I believe there might be chances that someone will copy my ideas. So I request you to take care of this issue. And I hope this might not affect my application. As of now I have changed the sharing settings, so please inform me if there will be any access problem. *Mail details:* from:Thiago Galery tgal...@gmail.comto:Abhishek Gupta a.gu...@gmail.com, dbpedia-gsoc dbpedia-gsoc@lists.sourceforge.net date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia I have also modified my proposal in Candidate Entity Scoring methodology. Please take a look at it. GSoC proposal link: https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/abhishek_g/5629499534213120 Google Docs Link: https://docs.google.com/document/d
Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia
Hi all, My proposal has not been selected for GSoC. But I am still want to continue with my project. So can someone provide me any guidelines (if I can continue)? Thanks, Abhishek On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Thanks for your reply and assurance. Moreover I replied your question for the extraction framework and I have also created an issue regarding using bold instances as the probable surface forms here https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/353. Thanks, Abhishek On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote: Hi Abhishek, sorry for taking so long to write to you. Things at work have been really busy. About the issue you raised about the originality of your proposal, rest assured that no one sent a proposal similar to yours. I'm happy that you send a PR for the extraction framework. It seems that Dimitris is already taking a look at it. As for your suggestions in Spotlight, just removing the stopword filter is something that I don't advise that much, cause I remember getting a lot of crap once. Maybe it should be modified somehow. If you have a good idea and want to send a PR, it would be very welcome. I think discussing things on github would be better. All the best, Thiago On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, Recently I was checking out the indexing process of dbpedia-spotlight and I observe a certain things: 1) There is a missing constructor definition in wikiPage object https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiPage.scala for instance defined in function wikiPageCopy here https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/io/DisambiguationContextSource.scala#L67. For this I have created an PR https://github.com/dbpedia/extraction-framework/pull/377 2) For stopwords filter defined here https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186, I did an analysis over the conceptURI's extraction with stopwords list here http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list. From the analysis it came out that we are neglecting around 25481 entities in which almost all of them are from important category like music, film, band etc. E.g. Am_(musician) http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film) http://en.wikipedia.org/wiki/Home_(2015_film), The_Who http://en.wikipedia.org/wiki/The_Who etc. And if we do case sensitive checking (checking if entity contains more than one capital alphabets as one is default) even then we will reject some entities which has only one word like Am, Home etc. Moreover the garbage (can't etc.) we will incur after removing this filter won't be much. So i suggest if we can remove this filter. 3) I would like to suggest a surface form extraction. If we can extract bold text in the first line of the wikipedia then we can use that as probable Surface Form for that entity. E.g. Stanford_University http://en.wikipedia.org/wiki/Stanford_University, Aon_(company) http://en.wikipedia.org/wiki/Aon_%28company%29, Radio_Warwick http://en.wikipedia.org/wiki/Radio_Warwick, Phi_Gamma_Delta http://en.wikipedia.org/wiki/Phi_Gamma_Delta etc. These are the best Surface Forms for the respective Entity. Thanks, Abhishek On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, I would also like to inform that in one of the recent mails my proposal has been gone public when Thiago accidentally sent a mail to me and dbpedia-gsoc mailing list. Details of the mails are below. The Google docs link was there in the quotes and the doc can be seen and even edited by anyone with that link, but nobody have changed the content of the doc. And I believe there might be chances that someone will copy my ideas. So I request you to take care of this issue. And I hope this might not affect my application. As of now I have changed the sharing settings, so please inform me if there will be any access problem. *Mail details:* from:Thiago Galery tgal...@gmail.comto:Abhishek Gupta a.gu...@gmail.com, dbpedia-gsoc dbpedia-gsoc@lists.sourceforge.net date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia I have also modified my proposal in Candidate Entity Scoring methodology. Please take a look at it. GSoC proposal link: https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/abhishek_g/5629499534213120 Google Docs Link: https://docs.google.com/document/d/1U4BvJpGUvL2odVA6VxnYggfEX_hmLSYP4yqhXB7dLQU/edit Moreover I would like to ask one more question which might help me in modelling the problem. In below example texts which entity would you
Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia
Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us commenting there. If you have done already, could you send us the link? All the best, Thiago On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi all, Here are some comments for your response: Hi Abishek, thanks for the work, here are some answers: On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Sorry for the delay! I have set up the spotlight server and it is running perfectly fine but with minimal settings. After this set up I played with spotIight server during which I came across some discrepancies as follows: Example taken: http://spotlight.dbpedia.org/rest/annotate?text=First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. In 1990 German reunification took place in whole Germany in which the city regained its status as the capital of Germany. 1) If we run this we annotate 13th Century to http://dbpedia.org/page/19th_century;. This might be happening because the context is very much from 19th century and moreover in 13th Century and 19th Century there is minimal syntactic difference (one letter). But I am not sure whether this is good or bad. This might be due to either 13th Century being wrongly linked to 19th century, or maybe the word century being linked to many different centuries which then causes a disambiguation error due to the context. I think your example is a counter-example to the way we generate the data structures used for disambiguation. In my opinion if we have an entity in our store ( http://dbpedia.org/page/13th_century) which is perfectly matching with surface form in raw text (13th Century) we should have annotated SF to the entity. And same might be the case with Germany which is associated to History of Germany http://dbpedia.org/page/History_of_Germany not Germany http://dbpedia.org/page/Germany. In this case other factors might have crept in, in could be that Germany has a bigger number of inlinks or some other metric that allows it to overtake the most natural candidate. 2) We are spotting place and associating it with Portland Place http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF. And even Location (geography) http://dbpedia.org/page/Location_(geography) is not the correct entity type for this. This is because we are not able to detect the sense of the word place itself. So for that we may have to use word senses like from Wordnet etc. The sf spottling pipeline works a bit like this, you get a candidate SF, like 'Portland Place' and see if there's a candidate for that, but you also consider n-gram subparts, so it could have retrieved the candidates associated with place instead. I understand what you said but over here I wanted to point out that place is not even a noun and we are trying to associate it with an Named Entity which is a noun. 3) We are detecting . Berlin as a surface form. But I don't came to know where this SF comes from. And I suspect this SF doesn't come from the Wikipedia. Although . Berlin is highlighted, the entity is matched on Berlin, the extra space and punctuation comes from the way we tokenize sentences. We have chosen to use a language independent tokenizer using a break iterator for speed and language independence, but it hasn't been tested very well. This is the area which explains this mistake and help in it is much appreciated. Thanks for clarification. 4) We spotted capital of Germany but I didn't get any candidates if we run for candidates instead of annotate. This might be due to a default confidence score. If you pass the extra confidence param and set it to 0, you will probably see everything, e.g. /candidates/?confidence=0text= In fact, I suggest you to see all the candidates in the text you used to confirm (or not) what I've been saying here. I tried to do that but I still didn't get any Entity Candidate for capital of Germany. 5) We are able to spot 1920s as a surface form but not 1920. This is due to the generation /stemming of sfs we have been discussed, but I'm not sure that is a bad example. 1920 if used as a year might no mean the same as 1920s. This was my mistake. Few more questions: 1) Are we trying to annotate every word, noun or entity(e.g. proper noun) in raw text? Because in the above link I found documented (a word not a noun or entity) annotated to http://dbpedia.org/resource/Document . There are two main spotters, the default one that uses a finite state automaton generated from the surface form store to match incoming words as valid sequence of states (so in this sense everything goes through the pipeline), another that uses
[Dbpedia-gsoc] Fwd: Contribute to DbPedia
-- Forwarded message -- From: Thiago Galery tgal...@gmail.com Date: Tue, Mar 17, 2015 at 11:29 AM Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia To: Abhishek Gupta a.gu...@gmail.com Hi Abishek, thanks for the work, here are some answers: On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote: Hi Thiago, Sorry for the delay! I have set up the spotlight server and it is running perfectly fine but with minimal settings. After this set up I played with spotIight server during which I came across some discrepancies as follows: Example taken: http://spotlight.dbpedia.org/rest/annotate?text=First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. In 1990 German reunification took place in whole Germany in which the city regained its status as the capital of Germany. 1) If we run this we annotate 13th Century to http://dbpedia.org/page/19th_century;. This might be happening because the context is very much from 19th century and moreover in 13th Century and 19th Century there is minimal syntactic difference (one letter). But I am not sure whether this is good or bad. This might be due to either 13th Century being wrongly linked to 19th century, or maybe the word century being linked to many different centuries which then causes a disambiguation error due to the context. I think your example is a counter-example to the way we generate the data structures used for disambiguation. In my opinion if we have an entity in our store ( http://dbpedia.org/page/13th_century) which is perfectly matching with surface form in raw text (13th Century) we should have annotated SF to the entity. And same might be the case with Germany which is associated to History of Germany http://dbpedia.org/page/History_of_Germany not Germany http://dbpedia.org/page/Germany. In this case other factors might have crept in, in could be that Germany has a bigger number of inlinks or some other metric that allows it to overtake the most natural candidate. 2) We are spotting place and associating it with Portland Place http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF. And even Location (geography) http://dbpedia.org/page/Location_(geography) is not the correct entity type for this. This is because we are not able to detect the sense of the word place itself. So for that we may have to use word senses like from Wordnet etc. The sf spottling pipeline works a bit like this, you get a candidate SF, like 'Portland Place' and see if there's a candidate for that, but you also consider n-gram subparts, so it could have retrieved the candidates associated with place instead. 3) We are detecting . Berlin as a surface form. But I don't came to know where this SF comes from. And I suspect this SF doesn't come from the Wikipedia. Although . Berlin is highlighted, the entity is matched on Berlin, the extra space and punctuation comes from the way we tokenize sentences. We have chosen to use a language independent tokenizer using a break iterator for speed and language independence, but it hasn't been tested very well. This is the area which explains this mistake and help in it is much appreciated. 4) We spotted capital of Germany but I didn't get any candidates if we run for candidates instead of annotate. This might be due to a default confidence score. If you pass the extra confidence param and set it to 0, you will probably see everything, e.g. /candidates/?confidence=0text= In fact, I suggest you to see all the candidates in the text you used to confirm (or not) what I've been saying here. 5) We are able to spot 1920s as a surface form but not 1920. This is due to the generation /stemming of sfs we have been discussed, but I'm not sure that is a bad example. 1920 if used as a year might no mean the same as 1920s. Few more questions: 1) Are we trying to annotate every word, noun or entity(e.g. proper noun) in raw text? Because in the above link I found documented (a word not a noun or entity) annotated to http://dbpedia.org/resource/Document;. There are two main spotters, the default one that uses a finite state automaton generated from the surface form store to match incoming words as valid sequence of states (so in this sense everything goes through the pipeline), another that uses a opennlp spotter that gets Sfs from a NE extractor. Both might generate single noun n-grams. In this case, it could be that there is a link in wikipedia documented - Document, which might introduce documented as a valid state in the FSA. 2) Are we using surface forms to deal with only syntactic references (e.g. surface form municipality referring to Municipality http://dbpedia.org/page/Municipality or Metropolitan_municipality