subject:"\[Dbpedia\-gsoc\] Fwd\: Contribute to DbPedia"

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-04-28 Thread David Przybilla

Hi Abhishek,

You are free to contribute :) I will try to keep on reviewing PRs
if that is alright.

On Tue, Apr 28, 2015 at 7:47 AM, Abhishek Gupta a.gu...@gmail.com wrote:

Hi all,

My proposal has not been selected for GSoC. But I am still want to
continue with my project. So can someone provide me any guidelines (if I
can continue)?

Thanks,
Abhishek

On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote:

Hi Thiago,

Thanks for your reply and assurance.
Moreover I replied your question for the extraction framework and I have
also created an issue regarding using bold instances as the probable
surface forms here
https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/353.

Thanks,
Abhishek

On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote:

Hi Abhishek,
sorry for taking so long to write to you. Things at work have been
really busy. About the issue you raised about the originality of your
proposal, rest assured that no one sent a proposal similar to yours.

I'm happy that you send a PR for the extraction framework. It seems that
Dimitris is already taking a look at it.
As for your suggestions in Spotlight, just removing the stopword filter
is something that I don't advise that much, cause I remember getting a lot
of crap once. Maybe it should be modified somehow. If you have a good idea
and want to send a PR, it would be very welcome. I think discussing things
on github would be better.

All the best,
Thiago

On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com
wrote:

Hi all,

Recently I was checking out the indexing process of dbpedia-spotlight
and I observe a certain things:

1) There is a missing constructor definition in wikiPage object
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiPage.scala
for
instance defined in function wikiPageCopy here
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/io/DisambiguationContextSource.scala#L67.
For this I have created an PR
https://github.com/dbpedia/extraction-framework/pull/377

2) For stopwords filter defined here
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186,
I did an analysis over the conceptURI's extraction with stopwords list
here
http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list.
From the analysis it came out that we are neglecting around 25481 entities
in which almost all of them are from important category like music, film,
band etc. E.g. Am_(musician)
http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film)
http://en.wikipedia.org/wiki/Home_(2015_film), The_Who
http://en.wikipedia.org/wiki/The_Who etc. And if we do case
sensitive checking (checking if entity contains more than one capital
alphabets as one is default) even then we will reject some entities which
has only one word like Am, Home etc. Moreover the garbage (can't etc.) we
will incur after removing this filter won't be much. So i suggest if we can
remove this filter.

3) I would like to suggest a surface form extraction. If we can extract
bold text in the first line of the wikipedia then we can use that as
probable Surface Form for that entity. E.g. Stanford_University
http://en.wikipedia.org/wiki/Stanford_University, Aon_(company)
http://en.wikipedia.org/wiki/Aon_%28company%29, Radio_Warwick
http://en.wikipedia.org/wiki/Radio_Warwick, Phi_Gamma_Delta
http://en.wikipedia.org/wiki/Phi_Gamma_Delta etc. These are the best
Surface Forms for the respective Entity.

Thanks,
Abhishek

On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com
wrote:

Hi all,

I would also like to inform that in one of the recent mails my
proposal has been gone public when Thiago accidentally sent a mail to me
and dbpedia-gsoc mailing list. Details of the mails are below. The Google
docs link was there in the quotes and the doc can be seen and even edited
by anyone with that link, but nobody have changed the content of the doc.
And I believe there might be chances that someone will copy my ideas. So
I request you to take care of this issue. And I hope this might not
affect my application.
As of now I have changed the sharing settings, so please inform me if
there will be any access problem.

*Mail details:*
from:Thiago Galery tgal...@gmail.comto:Abhishek Gupta
a.gu...@gmail.com,
dbpedia-gsoc dbpedia-gsoc@lists.sourceforge.net
date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd:
Contribute to DbPedia

I have also modified my proposal in Candidate Entity Scoring
methodology. Please take a look at it.
GSoC proposal link:
https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2015/abhishek_g/5629499534213120
Google Docs Link:
https://docs.google.com/document/d

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-04-28 Thread Abhishek Gupta

Hi all,

My proposal has not been selected for GSoC. But I am still want to continue
with my project. So can someone provide me any guidelines (if I can
continue)?

Thanks,
Abhishek

On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta a.gu...@gmail.com wrote:

Hi Thiago,

Thanks,
Abhishek

On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery tgal...@gmail.com wrote:

Hi Abhishek,
sorry for taking so long to write to you. Things at work have been really
busy. About the issue you raised about the originality of your proposal,
rest assured that no one sent a proposal similar to yours.

All the best,
Thiago

On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta a.gu...@gmail.com wrote:

Hi all,

Recently I was checking out the indexing process of dbpedia-spotlight
and I observe a certain things:

2) For stopwords filter defined here
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/scala/org/dbpedia/spotlight/util/ExtractCandidateMap.scala#L186,
I did an analysis over the conceptURI's extraction with stopwords list
here
http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.4/stopwords.en.list.
From the analysis it came out that we are neglecting around 25481 entities
in which almost all of them are from important category like music, film,
band etc. E.g. Am_(musician)
http://en.wikipedia.org/wiki/AM_(musician), Home_(2015_film)
http://en.wikipedia.org/wiki/Home_(2015_film), The_Who
http://en.wikipedia.org/wiki/The_Who etc. And if we do case sensitive
checking (checking if entity contains more than one capital alphabets as
one is default) even then we will reject some entities which has only one
word like Am, Home etc. Moreover the garbage (can't etc.) we will incur
after removing this filter won't be much. So i suggest if we can remove
this filter.

Thanks,
Abhishek

On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta a.gu...@gmail.com
wrote:

Hi all,

I would also like to inform that in one of the recent mails my proposal
has been gone public when Thiago accidentally sent a mail to me and
dbpedia-gsoc mailing list. Details of the mails are below. The Google docs
link was there in the quotes and the doc can be seen and even edited by
anyone with that link, but nobody have changed the content of the doc. And
I believe there might be chances that someone will copy my ideas. So I
request you to take care of this issue. And I hope this might not
affect my application.
As of now I have changed the sharing settings, so please inform me if
there will be any access problem.

Moreover I would like to ask one more question which might help me in
modelling the problem. In below example texts which entity would you

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-23 Thread Thiago Galery

Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us
commenting there. If you have done already, could you send us the link?
All the best,
Thiago

On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi all,

 Here are some comments for your response:


 Hi Abishek, thanks for the work, here are some answers:

 On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com
 wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.


 This might be due to either 13th Century being wrongly linked to 19th
 century, or maybe the word century being linked to many different
 centuries which then causes a disambiguation error due to the context. I
 think your example is a counter-example to the way we generate the data
 structures used for disambiguation.


 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF
 to the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.


 In this case other factors might have crept in, in could be that Germany
 has a bigger number of inlinks or some other metric that allows it to
 overtake the most natural candidate.



 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming
 SF. And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct
 entity type for this. This is because we are not able to detect the sense
 of the word place itself. So for that we may have to use word senses
 like from Wordnet etc.


 The sf spottling pipeline works a bit like this, you get a candidate SF,
 like 'Portland Place' and see if there's a candidate for that, but you also
 consider n-gram subparts, so it could have retrieved the candidates
 associated with place instead.


 I understand what you said but over here I wanted to point out that 
 place is not even a noun and we are trying to associate it with an Named
 Entity which is a noun.





 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.


 Although . Berlin is highlighted, the entity is matched on Berlin,
 the extra space and punctuation comes from the way we tokenize sentences.
 We have chosen to use a language independent tokenizer using a break
 iterator for speed and language independence, but it hasn't been tested
 very well. This is the area which explains this mistake and help in it is
 much appreciated.


 Thanks for clarification.





 4) We spotted capital of Germany but I didn't get any candidates if
 we run for candidates instead of annotate.


 This might be due to a default confidence score. If you pass the extra
 confidence param and set it to 0, you will probably see everything, e.g.
 /candidates/?confidence=0text=
 In fact, I suggest you to see all the candidates in the text you used to
 confirm (or not) what I've been saying here.


 I tried to do that but I still didn't get any Entity Candidate for capital
 of Germany.




 5) We are able to spot 1920s as a surface form but not 1920.


 This is due to the generation /stemming of sfs we have been discussed,
 but I'm not sure that is a bad example. 1920 if used as a year might no
 mean the same as 1920s.


 This was my mistake.





 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper
 noun) in raw text? Because in the above link I found documented (a word
 not a noun or entity) annotated to http://dbpedia.org/resource/Document
 .


 There are two main spotters, the default one that uses a finite state
 automaton generated from the surface form store to match incoming words as
 valid sequence of states (so in this sense everything goes through the
 pipeline), another that uses

[Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-17 Thread Thiago Galery

-- Forwarded message --
From: Thiago Galery tgal...@gmail.com
Date: Tue, Mar 17, 2015 at 11:29 AM
Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia
To: Abhishek Gupta a.gu...@gmail.com

Hi Abishek, thanks for the work, here are some answers:

On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta a.gu...@gmail.com wrote:

 Hi Thiago,

 Sorry for the delay!
 I have set up the spotlight server and it is running perfectly fine but
 with minimal settings. After this set up I played with spotIight server
 during which I came across some discrepancies as follows:

 Example taken:
 http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
 the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
 Reich (1933–45). Berlin in the 1920s was the third largest municipality in
 the world. In 1990 German reunification took place in whole Germany in
 which the city regained its status as the capital of Germany.

 1) If we run this we annotate 13th Century to 
 http://dbpedia.org/page/19th_century;. This might be happening because
 the context is very much from 19th century and moreover in 13th Century
 and 19th Century there is minimal syntactic difference (one letter).
 But I am not sure whether this is good or bad.

This might be due to either 13th Century being wrongly linked to 19th
century, or maybe the word century being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.

 In my opinion if we have an entity in our store (
 http://dbpedia.org/page/13th_century) which is perfectly matching with
 surface form in raw text (13th Century) we should have annotated SF to
 the entity.
 And same might be the case with Germany which is associated to History
 of Germany http://dbpedia.org/page/History_of_Germany not Germany
 http://dbpedia.org/page/Germany.

In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.

 2) We are spotting place and associating it with Portland Place
 http://dbpedia.org/resource/Portland_Place, maybe due to stemming SF.
 And even Location (geography)
 http://dbpedia.org/page/Location_(geography) is not the correct entity
 type for this. This is because we are not able to detect the sense of the
 word place itself. So for that we may have to use word senses like from
 Wordnet etc.

The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with place instead.

 3) We are detecting . Berlin as a surface form. But I don't came to
 know where this SF comes from. And I suspect this SF doesn't come from the
 Wikipedia.

Although . Berlin is highlighted, the entity is matched on Berlin, the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much
appreciated.

 4) We spotted capital of Germany but I didn't get any candidates if we
 run for candidates instead of annotate.

This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
/candidates/?confidence=0text=
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.

 5) We are able to spot 1920s as a surface form but not 1920.

This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.

 Few more questions:
 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
 in raw text? Because in the above link I found documented (a word not a
 noun or entity) annotated to http://dbpedia.org/resource/Document;.

There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match incoming words as
valid sequence of states (so in this sense everything goes through the
pipeline), another that uses a opennlp spotter that gets Sfs from a NE
extractor. Both might generate single noun n-grams. In this case, it could
be that there is a link in wikipedia documented - Document, which might
introduce documented as a valid state in the FSA.

 2) Are we using surface forms to deal with only syntactic references (e.g.
 surface form municipality referring to Municipality
 http://dbpedia.org/page/Municipality or Metropolitan_municipality

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

[Dbpedia-gsoc] Fwd: Contribute to DbPedia

4 matches

Site Navigation

Mail list logo

Footer information