Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-04-28 Thread David Przybilla
Hi Abhishek,

You are free to contribute :) I will try to keep on reviewing PRs
if that is alright.



On Tue, Apr 28, 2015 at 7:47 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi all,

 My proposal has not been selected for GSoC. But I am still want to
 continue with my project. So can someone provide me any guidelines (if I
 can continue)?

 Thanks,
 Abhishek

 On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi Thiago,

 Thanks for your reply and assurance.
 Moreover I replied your question for the extraction framework and I have
 also created an issue regarding using bold instances as the probable
 surface forms here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/353.

 Thanks,
 Abhishek

 On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote:

 Hi Abhishek,
 sorry for taking so long to write to you. Things at work have been
 really busy. About the issue you raised about the originality of your
 proposal, rest assured that no one sent a proposal similar to yours.

 I'm happy that you send a PR for the extraction framework. It seems that
 Dimitris is already taking a look at it.
 As for your suggestions in Spotlight, just removing the stopword filter
 is something that I don't advise that much, cause I remember getting a lot
 of crap once. Maybe it should be modified somehow. If you have a good idea
 and want to send a PR, it would be very welcome. I think discussing things
 on github would be better.

 All the best,
 Thiago

 On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi all,

 Recently I was checking out the indexing process of dbpedia-spotlight
 and I observe a certain things:

 1) There is a missing constructor definition in wikiPage object
 https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiPage.scala
  for
 instance defined in function wikiPageCopy here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/io/DisambiguationContextSource.scala#L67.
 For this I have created an PR
 https://github.com/dbpedia/extraction-framework/pull/377

 2) For stopwords filter defined here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186,
 I did an analysis over the conceptURI's extraction with stopwords list
 here
 http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list.
 From the analysis it came out that we are neglecting around 25481 entities
 in which almost all of them are from important category like music, film,
 band etc. E.g. Am_(musician)
 http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film)
 http://en.wikipedia.org/wiki/Home_(2015_film), The_Who
 http://en.wikipedia.org/wiki/The_Who etc. And if we do case
 sensitive checking (checking if entity contains more than one capital
 alphabets as one is default) even then we will reject some entities which
 has only one word like Am, Home etc. Moreover the garbage (can't etc.) we
 will incur after removing this filter won't be much. So i suggest if we can
 remove this filter.

 3) I would like to suggest a surface form extraction. If we can extract
 bold text in the first line of the wikipedia then we can use that as
 probable Surface Form for that entity. E.g. Stanford_University
 http://en.wikipedia.org/wiki/Stanford_University, Aon_(company)
 http://en.wikipedia.org/wiki/Aon_%28company%29, Radio_Warwick
 http://en.wikipedia.org/wiki/Radio_Warwick, Phi_Gamma_Delta
 http://en.wikipedia.org/wiki/Phi_Gamma_Delta etc. These are the best
 Surface Forms for the respective Entity.

 Thanks,
 Abhishek

 On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi all,

 I would also like to inform that in one of the recent mails my
 proposal has been gone public when Thiago accidentally sent a mail to me
 and dbpedia-gsoc mailing list. Details of the mails are below. The Google
 docs link was there in the quotes and the doc can be seen and even edited
 by anyone with that link, but nobody have changed the content of the doc.
 And I believe there might be chances that someone will copy my ideas. So
 I request you to take care of this issue. And I hope this might not
 affect my application.
 As of now I have changed the sharing settings, so please inform me if
 there will be any access problem.

 *Mail details:*
 from:Thiago Galery tgal...@gmail.comto:Abhishek Gupta 
 a.gu...@gmail.com,
 dbpedia-gsoc dbpedia-gsoc@lists.sourceforge.net
 date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd:
 Contribute to DbPedia

 I have also modified my proposal in Candidate Entity Scoring
 methodology. Please take a look at it.
 GSoC proposal link:
 https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/abhishek_g/5629499534213120
 Google Docs Link:
 https://docs.google.com/document/d

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-04-28 Thread Abhishek Gupta
Hi all,

My proposal has not been selected for GSoC. But I am still want to continue
with my project. So can someone provide me any guidelines (if I can
continue)?

Thanks,
Abhishek

On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi Thiago,

 Thanks for your reply and assurance.
 Moreover I replied your question for the extraction framework and I have
 also created an issue regarding using bold instances as the probable
 surface forms here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/353.

 Thanks,
 Abhishek

 On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote:

 Hi Abhishek,
 sorry for taking so long to write to you. Things at work have been really
 busy. About the issue you raised about the originality of your proposal,
 rest assured that no one sent a proposal similar to yours.

 I'm happy that you send a PR for the extraction framework. It seems that
 Dimitris is already taking a look at it.
 As for your suggestions in Spotlight, just removing the stopword filter
 is something that I don't advise that much, cause I remember getting a lot
 of crap once. Maybe it should be modified somehow. If you have a good idea
 and want to send a PR, it would be very welcome. I think discussing things
 on github would be better.

 All the best,
 Thiago

 On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi all,

 Recently I was checking out the indexing process of dbpedia-spotlight
 and I observe a certain things:

 1) There is a missing constructor definition in wikiPage object
 https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiPage.scala
  for
 instance defined in function wikiPageCopy here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/io/DisambiguationContextSource.scala#L67.
 For this I have created an PR
 https://github.com/dbpedia/extraction-framework/pull/377

 2) For stopwords filter defined here
 https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186,
 I did an analysis over the conceptURI's extraction with stopwords list
 here
 http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list.
 From the analysis it came out that we are neglecting around 25481 entities
 in which almost all of them are from important category like music, film,
 band etc. E.g. Am_(musician)
 http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film)
 http://en.wikipedia.org/wiki/Home_(2015_film), The_Who
 http://en.wikipedia.org/wiki/The_Who etc. And if we do case sensitive
 checking (checking if entity contains more than one capital alphabets as
 one is default) even then we will reject some entities which has only one
 word like Am, Home etc. Moreover the garbage (can't etc.) we will incur
 after removing this filter won't be much. So i suggest if we can remove
 this filter.

 3) I would like to suggest a surface form extraction. If we can extract
 bold text in the first line of the wikipedia then we can use that as
 probable Surface Form for that entity. E.g. Stanford_University
 http://en.wikipedia.org/wiki/Stanford_University, Aon_(company)
 http://en.wikipedia.org/wiki/Aon_%28company%29, Radio_Warwick
 http://en.wikipedia.org/wiki/Radio_Warwick, Phi_Gamma_Delta
 http://en.wikipedia.org/wiki/Phi_Gamma_Delta etc. These are the best
 Surface Forms for the respective Entity.

 Thanks,
 Abhishek

 On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi all,

 I would also like to inform that in one of the recent mails my proposal
 has been gone public when Thiago accidentally sent a mail to me and
 dbpedia-gsoc mailing list. Details of the mails are below. The Google docs
 link was there in the quotes and the doc can be seen and even edited by
 anyone with that link, but nobody have changed the content of the doc. And
 I believe there might be chances that someone will copy my ideas. So I
 request you to take care of this issue. And I hope this might not
 affect my application.
 As of now I have changed the sharing settings, so please inform me if
 there will be any access problem.

 *Mail details:*
 from:Thiago Galery tgal...@gmail.comto:Abhishek Gupta 
 a.gu...@gmail.com,
 dbpedia-gsoc dbpedia-gsoc@lists.sourceforge.net
 date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd:
 Contribute to DbPedia

 I have also modified my proposal in Candidate Entity Scoring
 methodology. Please take a look at it.
 GSoC proposal link:
 https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/abhishek_g/5629499534213120
 Google Docs Link:
 https://docs.google.com/document/d/1U4BvJpGUvL2odVA6VxnYggfEX_hmLSYP4yqhXB7dLQU/edit

 Moreover I would like to ask one more question which might help me in
 modelling the problem. In below example texts which entity would you

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-23 Thread Thiago Galery
Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us
commenting there. If you have done already, could you send us the link?
All the best,
Thiago

On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi all,

 Here are some comments for your response:


 Hi Abishek, thanks for the work, here are some answers:

 On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.


 This might be due to either 13th Century being wrongly linked to 19th
 century, or maybe the word century being linked to many different
 centuries which then causes a disambiguation error due to the context. I
 think your example is a counter-example to the way we generate the data
 structures used for disambiguation.


 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF
 to the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.


 In this case other factors might have crept in, in could be that Germany
 has a bigger number of inlinks or some other metric that allows it to
 overtake the most natural candidate.



 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming
 SF. And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct
 entity type for this. This is because we are not able to detect the sense
 of the word place itself. So for that we may have to use word senses
 like from Wordnet etc.


 The sf spottling pipeline works a bit like this, you get a candidate SF,
 like 'Portland Place' and see if there's a candidate for that, but you also
 consider n-gram subparts, so it could have retrieved the candidates
 associated with place instead.


 I understand what you said but over here I wanted to point out that 
 place is not even a noun and we are trying to associate it with an Named
 Entity which is a noun.





 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.


 Although . Berlin is highlighted, the entity is matched on Berlin,
 the extra space and punctuation comes from the way we tokenize sentences.
 We have chosen to use a language independent tokenizer using a break
 iterator for speed and language independence, but it hasn't been tested
 very well. This is the area which explains this mistake and help in it is
 much appreciated.


 Thanks for clarification.





 4) We spotted capital of Germany but I didn't get any candidates if
 we run for candidates instead of annotate.


 This might be due to a default confidence score. If you pass the extra
 confidence param and set it to 0, you will probably see everything, e.g.
 /candidates/?confidence=0text=
 In fact, I suggest you to see all the candidates in the text you used to
 confirm (or not) what I've been saying here.


 I tried to do that but I still didn't get any Entity Candidate for capital
 of Germany.




 5) We are able to spot 1920s as a surface form but not 1920.


 This is due to the generation /stemming of sfs we have been discussed,
 but I'm not sure that is a bad example. 1920 if used as a year might no
 mean the same as 1920s.


 This was my mistake.





 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper
 noun) in raw text? Because in the above link I found documented (a word
 not a noun or entity) annotated to http://dbpedia.org/resource/Document
 .


 There are two main spotters, the default one that uses a finite state
 automaton generated from the surface form store to match incoming words as
 valid sequence of states (so in this sense everything goes through the
 pipeline), another that uses 

[Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-17 Thread Thiago Galery
-- Forwarded message --
From: Thiago Galery tgal...@gmail.com
Date: Tue, Mar 17, 2015 at 11:29 AM
Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia
To: Abhishek Gupta a.gu...@gmail.com


Hi Abishek, thanks for the work, here are some answers:

On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.


This might be due to either 13th Century being wrongly linked to 19th
century, or maybe the word century being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.


 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF to
 the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.


In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.



 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF.
 And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct entity
 type for this. This is because we are not able to detect the sense of the
 word place itself. So for that we may have to use word senses like from
 Wordnet etc.


The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with place instead.



 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.


Although . Berlin is highlighted, the entity is matched on Berlin, the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much
appreciated.



 4) We spotted capital of Germany but I didn't get any candidates if we
 run for candidates instead of annotate.


This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
/candidates/?confidence=0text=
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.



 5) We are able to spot 1920s as a surface form but not 1920.


This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.



 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
 in raw text? Because in the above link I found documented (a word not a
 noun or entity) annotated to http://dbpedia.org/resource/Document;.


There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match incoming words as
valid sequence of states (so in this sense everything goes through the
pipeline), another that uses a opennlp spotter that gets Sfs from a NE
extractor. Both might generate single noun n-grams. In this case, it could
be that there is a link in wikipedia documented - Document, which might
introduce documented as a valid state in the FSA.


 2) Are we using surface forms to deal with only syntactic references (e.g.
 surface form municipality referring to Municipality
 http://dbpedia.org/page/Municipality or Metropolitan_municipality