Re: [Dbpedia-gsoc] GSoC introduction

2015-03-17 Thread Dimitris Kontokostas
Hi Prateek,

On Wed, Mar 11, 2015 at 10:44 PM, Prateek Saxena <
prateeksaxena2...@gmail.com> wrote:

> Hello Jens,
>
> Sorry for the delay in getting back to you. Thanks for your clarifications.
>
> From what i understand, the first task is to port the DL learner plugin
> for Protege  to
> webprotege. While researching for methods to create plugins for webprotege,
> I stumbled upon the webprotege github page
> . The readme mentions that
> in order to create a plugin for web protege, you might need to build
> webprotege on a local machine. However, there is a dbpedia ontology
> 
>  project
> on the stanford hosted solution. Is the future plan to continue using the
> stanford hosted solution or to host webprotege on dbpedia's own servers?
>

The plan is to use our own servers and we developed some addon plugins to
keep the mappings wiki in sync with web-protege.


>
> I was able to find detailed instructions
> 
> for building plugins in protege but could not find an equivalent resource
> for webprotege. Could you suggest a resource or a warm-up task to begin
> with?
>
> Regards,
> Prateek
>
>
> On Mon, Mar 9, 2015 at 4:18 PM, Jens Lehmann <
> lehm...@informatik.uni-leipzig.de> wrote:
>
>>
>> Dear Prateek,
>>
>> Am 03.03.2015 um 22:14 schrieb Prateek Saxena:
>> > Hello,
>> >
>> > My name is Prateek Saxena and I am pursuing my M.S. in the domain of
>> > 'Natural Language Processing' and 'semantic web' from IIIT Hyderabad. I
>> > have been working in the domain of ontologies and feature extraction and
>> > am familiar with creation of knowledge bases.
>> >
>> > While going through the list of GSoC 2015 DBpedia ideas, I found the
>> > following projects interesting and also the best use of my skillset.
>> >
>> > 5.11- DBpedia Schema Enrichment on Web Protege
>> > 5.13- Aligning Life-Science Ontologies to Dbpedia
>>
>> Thanks for your interest in the projects.
>>
>> > On the basis of my understanding of the projects from the description of
>> > the project ideas (and the provided resources), I have a few queries.
>> > Could you kindly answer these queries or bring to light any
>> > discrepancies in my understanding?
>> >
>> > 1. Parsing - The paper mentions that parsing has been done using wiki
>> > parser. I am unsure whether the parsing results in creating only
>> > syntactic links or does it result in extracting some shallow semantic
>> > dependencies(eg. nsubj) as well.
>> >
>> > 2. Feature Extraction:- The paper does have a few examples of features
>> > from raw infobox but does Dbpedia already have a closed list of features
>> > or does the project entail only creating groups based on semantic
>> > similarity. Because if the latter is true, the list shall be an open
>> > class and the usability shall account for non-completeness.
>>
>> Could you be more specific on the paper you are referring to? I assume
>> you mean http://jens-lehmann.org/files/2014/swj_dbpedia.pdf.
>>
>> For the enrichment topic (5.11) you would actually not need parsing and
>> feature extraction as the task would be to develop a DL-Learner plugin
>> for Web Protégé not directly interact with the extraction part of the
>> framework.
>>
>> For the life science extraction, we would still have to see how and
>> whether extraction framework modifications are necessary. A large part
>> of the alignment to life science ontologies can be done via the mappings
>> wiki and in postprocessing via link discovery (using LIMES -
>> http://aksw.org/Projects/LIMES.html). You can also ask general DBpedia
>> developer questions (and those seem to be general) in the developers
>> list: https://sourceforge.net/p/dbpedia/mailman/.
>>
>> > 3.DL learner- The DL learner helps in creating a framework following
>> > description logic. However, Protege also provides First order logic
>> > through the use of the SWRL
>> >  tab. Is the use of DL a
>> > hard-lined specification of the project, or is it all right to use both
>> > FOL and DL?
>>
>> First, you have to distinguish between "Protégé" which is a desktop
>> application and "Web Protégé" which is a web application and thus more
>> suitable for collaborating on the DBpedia ontology. In order to work
>> with SWRL, you would need the following: 1.) SWRL support in Web
>> Protégé, 2.) SWRL-Support in DL-Learner and 3.) some consensus in the
>> DBpedia ontology group to use SWRL (there is a dbpedia-ontology list to
>> which such a question could be posted - see the above link). So while it
>> is not impossible to do, it would be a larger project, whereas the use
>> of OWL (which builds on description logics) is already established in
>> DBpedia.
>>
>> > I would be grateful if you could tell me more about the project and
>>

[Dbpedia-gsoc] Fwd: Welcome to the "Dbpedia-gsoc" mailing list (Digest mode)

2015-03-17 Thread ashish surve
Hi all,

I am Ashish.
I'm from Symbiosis institute of computer studies and research, Pune, India.

I worked as an intern on NLP and that's where i developed my interest
towards NLP and
Machine Learning.
During my internship i worked on python but i'm also familiar with C/C++, Java.
I came to know about Dbpedia while i was working as an intern on NER.

I'm quite interested in the project - 5.1.  Fact Extraction from Wikipedia Text

I am going to start with downloading its source from github and also
trying out some of

the warm-up challenges.

Thanks and Regards,

Ashish


On Tue, Mar 17, 2015 at 12:30 PM, <
dbpedia-gsoc-requ...@lists.sourceforge.net> wrote:

> Welcome to the Dbpedia-gsoc@lists.sourceforge.net mailing list!
>
> To post to this list, send your email to:
>
>   dbpedia-gsoc@lists.sourceforge.net
>
> General information about the mailing list is at:
>
>   https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
> If you ever want to unsubscribe or change your options (eg, switch to
> or from digest mode, change your password, etc.), visit your
> subscription page at:
>
>
> https://lists.sourceforge.net/lists/options/dbpedia-gsoc/surve.ashish21%40gmail.com
>
>
> You can also make such adjustments via email by sending a message to:
>
>   dbpedia-gsoc-requ...@lists.sourceforge.net
>
> with the word `help' in the subject or body (don't include the
> quotes), and you will get back a message with instructions.
>
> You must know your password to change your options (including changing
> the password, itself) or to unsubscribe.  It is:
>
>   soccerdude
>
> Normally, Mailman will remind you of your lists.sourceforge.net
> mailing list passwords once every month, although you can disable this
> if you prefer.  This reminder will also include instructions on how to
> unsubscribe or change your account options.  There is also a button on
> your options page that will email your current password to you.
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Newer to comtribute

2015-03-17 Thread Dimitris Kontokostas
Hi Hzshuai & welcome to DBpedia,

Please read our student guide
http://dbpedia.org/gsoc2015/ideas#h460-3

and you are more than welcome to ask questions on project ideas you like.

Cheers,
Dimtiris

On Tue, Mar 17, 2015 at 5:08 AM, 黄章帅  wrote:

> Hi all,
>
> I am Hzshuai.
> I'm from EECS of Peking University, Beijing, China. My research field
> includes NLP and machine learning.
>
> I have experience in information extraction, machine translation (from
> Chinese to English or reverse )
> when I was an intern in a start-up in Beijing for 6 month.
> I am familiar with C/C++, python, java. But I'm a beginner in Scala, which
> (as far as I know ) is a cool functional language.
>
> I know DBpedia from the GSoC2015.  After browsing this project, I realize
> it is great choice for me to get involved with
> Open Source Community. Now,  I'm eager to make some contribution to
> DBpedia codebase, not only to practise the knowledge
> I have learned, but also to dirty my hands and solve real challenging
> problems.
>
> These days, I am trying to figure out the tasks in this projects, like
> ideas for GSoC2015, programming language and environment.
> Since I am in my first year of MS,  I can spend a lot of time on enjoying
> open-source. So I'd like to participate a long-term task.
> Any help or  easy-to-hard directions will be highly appreciated.
>
> Thanks.
>
> 2015-03-16 22:47 GMT+08:00 黄章帅 :
>
>> Hi all,
>>
>> I am Hzshuai.
>> I'm from EECS of Peking University, Beijing, China. My research field
>> includes NLP and machine learning.
>>
>> I have experience in information extraction, machine translation (from
>> Chinese to English or reverse )
>> when I was an intern in a start-up in Beijing for 6 month.
>> I am familiar with C/C++, python, java. But I'm a beginner in Scala,
>> which (as far as I know ) is a cool functional language.
>>
>> I know DBpedia from the GSoC2015.  After browsing this project, I realize
>> it is great choice for me to get involved with
>> Open Source Community. Now,  I'm eager to make some contribution to
>> DBpedia codebase, not only to practise the knowledge
>> I have learned, but also to dirty my hands and solve real challenging
>> problems.
>>
>> These days, I am trying to figure out the tasks in this projects, like
>> ideas for GSoC2015, programming language and environment.
>> Since I am in my first year of MS,  I can spend a lot of time on enjoying
>> open-source. So I'd like to participate a long-term task.
>> Any help or  easy-to-hard directions will be highly appreciated.
>>
>> Thanks.
>>
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>


-- 
Kontokostas Dimitris
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSOC_2015 Fact Extraction from Wikipedia Text

2015-03-17 Thread Marco Fossati
Hi Kasun,

On 3/16/15 7:28 AM, kasun perera wrote:
> Hi Marco
>
> After going through the Warm-up tasks I have started writing the GSOC
> proposal. But going through the code-repo and Warm-up tasks I see
> current code take as input a Wikipedia corpus and perform the following
> steps:
>
>  1. Verb extraction and ranking
>  2. Frame Classifier Training
>  3. Frame Extraction
>
> But in project idea 5.1. it says these steps are to be implemented
> during the GSOC period. But I see above steps are already implemented in
> the current fact-extractor code right?
The implementation is far from complete, the current codebase is more 
the first step from ideas to code.
> So what are the project
> expectations during the GSOC period? please clarify.
To implement a text extractor to be included in the DBpedia Extraction 
Framework.
>
>
> On Wed, Mar 4, 2015 at 1:59 PM, hell.j@gmail.com
>   > wrote:
>
> Hi Kasun,
>
> The sentence substrings identified via entity linking will be the fe
> candidates.
> Then I think you got the idea behind the verb ranking step.
>
> Cheers!
>
> - Reply message -
> Da: "kasun perera"  >
> A: "Marco Fossati" mailto:hell.j@gmail.com>>
> Cc: "dbpedia-gsoc"  >
> Oggetto: GSOC_2015 Fact Extraction from Wikipedia Text
> Data: mer, mar 4, 2015 07:08
>
>
> Hi Marco
>
> On Mon, Mar 2, 2015 at 5:23 PM, Marco Fossati  > wrote:
>
>
>
> 2- Also it mentioned the use of NLP techniques to process
> Wikipedia
> text. Does this means extraction of Dependency relationships
> to get the
> frame elements (FE) and lexical unit(LU)?
>
> Dependency parsing may not be needed, since entity linking can
> be applied to fulfill the task.
>
>
> I'm not clear what you mean by use of entity-linking to identify FE
> candidates. In general Named entity linking (NEL) means linking the
> mentions of entities in text to a central knowledge base(e.g.
> Wikipedia). Do you mean to use the above concept to find FE's? Can
> you please clarify bit more on use of entity-linking to identify FE's?
>
> This is the my understanding of the step-1 of the idea i.e. Verb
> extraction and Ranking.
> We use a list of domains (e.g. Sports) then dig in to more specific
> sub-domain (e.g. Soccer, Cricket, Rugby ect) of Wikipedia. The
> navigate to specific wiki-pages under the sub domain. For each wiki
> page we extract and rank the verbs based on the sub-domain and
> higher ranked verbs are used as LU's.
> What are your comments about this idea?
>
> Thanks
>
>
>
>
> --
> Regards
>
> Kasun Perera
>
>
>
>
> --
> Regards
>
> Kasun Perera
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Fwd: Welcome to the "Dbpedia-gsoc" mailing list (Digest mode)

2015-03-17 Thread Marco Fossati
Hi Ashish,

Cool, have a look at the active pull requests and try to pick a task 
that is still not taken.
Cheers!

On 3/17/15 8:47 AM, ashish surve wrote:
>
>
>
>
> Hi all,
>
> I am Ashish.
> I'm from Symbiosis institute of computer studies and research, Pune, India.
>
> I worked as an intern on NLP and that's where i developed my interest towards 
> NLP and
> Machine Learning.
> During my internship i worked on python but i'm also familiar with C/C++, 
> Java.
> I came to know about Dbpedia while i was working as an intern on NER.
>
> I'm quite interested in the project - 5.1.  Fact Extraction from Wikipedia 
> Text
>
> I am going to start with downloading its source from github and also trying 
> out some of
>
> the warm-up challenges.
>
> Thanks and Regards,
>
> Ashish
>
>
>
>
> On Tue, Mar 17, 2015 at 12:30 PM,
>  > wrote:
>
> Welcome to the Dbpedia-gsoc@lists.sourceforge.net
>  mailing list!
>
> To post to this list, send your email to:
>
> dbpedia-gsoc@lists.sourceforge.net
> 
>
> General information about the mailing list is at:
>
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
> If you ever want to unsubscribe or change your options (eg, switch to
> or from digest mode, change your password, etc.), visit your
> subscription page at:
>
> 
> https://lists.sourceforge.net/lists/options/dbpedia-gsoc/surve.ashish21%40gmail.com
>
>
> You can also make such adjustments via email by sending a message to:
>
> dbpedia-gsoc-requ...@lists.sourceforge.net
> 
>
> with the word `help' in the subject or body (don't include the
> quotes), and you will get back a message with instructions.
>
> You must know your password to change your options (including changing
> the password, itself) or to unsubscribe.  It is:
>
>soccerdude
>
> Normally, Mailman will remind you of your lists.sourceforge.net
> 
> mailing list passwords once every month, although you can disable this
> if you prefer.  This reminder will also include instructions on how to
> unsubscribe or change your account options.  There is also a button on
> your options page that will email your current password to you.
>
>
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-17 Thread Abhishek Gupta
Hi Thiago,

Sorry for the delay!
I have set up the spotlight server and it is running perfectly fine but
with minimal settings. After this set up I played with spotIight server
during which I came across some discrepancies as follows:

Example taken:
http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
Reich (1933–45). Berlin in the 1920s was the third largest municipality in
the world. In 1990 German reunification took place in whole Germany in which
 the city regained its status as the capital of Germany.

1) If we run this we annotate "13th Century" to "
http://dbpedia.org/page/19th_century";. This might be happening because the
context is very much from 19th century and moreover in "13th Century" and "19th
Century" there is minimal syntactic difference (one letter). But I am not
sure whether this is good or bad.
In my opinion if we have an entity in our store (
http://dbpedia.org/page/13th_century) which is perfectly matching with
surface form in raw text ("13th Century") we should have annotated SF to
the entity.
And same might be the case with "Germany" which is associated to "History
of Germany " not "Germany
".

2) We are spotting "place" and associating it with "Portland Place
", maybe due to stemming SF.
And even "Location (geography)
" is not the correct entity
type for this. This is because we are not able to detect the sense of the
word "place" itself. So for that we may have to use word senses like from
Wordnet etc.

3) We are detecting ". Berlin" as a surface form. But I don't came to know
where this SF comes from. And I suspect this SF doesn't come from the
Wikipedia.

4) We spotted "capital of Germany" but I didn't get any candidates if we
run for "candidates" instead of "annotate".

5) We are able to spot "1920s" as a surface form but not "1920".

Few more questions:
1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
in raw text? Because in the above link I found "documented" (a word not a
noun or entity) annotated to "http://dbpedia.org/resource/Document";.

2) Are we using surface forms to deal with only syntactic references (e.g.
surface form "municipality" referring to "Municipality
" or "Metropolitan_municipality
" or "
Municipalities_of_Mexico ")
or both, syntactic and semantic references (e.g. aliases like "Third Reich"
referring to "Nazi Germany ")?

I am working on generating extra possible surface forms from
a canonical surface form or the entity itself to deal with unseen SF
association problems.
I have also started working on my proposal will also submit it soon.

Thanks,
Abhishek

On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery  wrote:

> Hi Abhishek, thanks for the contribution. Your suggestions are pretty much
> aligned with what we where thinking in any event, and the initial plan
> seems good.
> On the assumption that there's some code that generates extra possible
> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
> literature on Machine Translation on how to establish some score for the
> surface form. That is, if you spot 'M Jordan' on the text, what is the
> probability of it being a translation of the canonical name 'Michael
> Jordan' .  If there's a simple way to implement this, we could try to get
> the raw data with counts, generate some extra sfs in a principle manner and
> use that to calculate probabilities. Still for the moment, I'd focus on
> setting the spotlight server up and play with the warm up tasks.
> Thanks for the good work,
> Thiago
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


[Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-17 Thread Thiago Galery
-- Forwarded message --
From: Thiago Galery 
Date: Tue, Mar 17, 2015 at 11:29 AM
Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia
To: Abhishek Gupta 


Hi Abishek, thanks for the work, here are some answers:

On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta  wrote:

> Hi Thiago,
>
> Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
>
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
>
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century";. This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.
>

This might be due to either "13th Century" being wrongly linked to 19th
century, or maybe the word "century" being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.


> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany " not "Germany
> ".
>

In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.


>
> 2) We are spotting "place" and associating it with "Portland Place
> ", maybe due to stemming SF.
> And even "Location (geography)
> " is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.
>

The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with "place" instead.


>
> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.
>

Although ". Berlin" is highlighted, the entity is matched on "Berlin", the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much
appreciated.


>
> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".
>

This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
/candidates/?confidence=0&text=
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.


>
> 5) We are able to spot "1920s" as a surface form but not "1920".
>

This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.


>
> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document";.
>
>
There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match incoming words as
valid sequence of states (so in this sense everything goes through the
pipeline), another that uses a opennlp spotter that gets Sfs from a NE
extractor. Both might generate single noun n-grams. In this case, it could
be that there is a link in wikipedia "documented" -> Document, which might
introduce "documented" as a valid state in the FSA.


> 2) Are we using surface forms to deal with only syntactic references (e.g.
> surface form "municipality" referring to "Municipality
> <

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-17 Thread Joachim Daiber
Hi Abhishek, Thiago,

please also note that

http://spotlight.dbpedia.org/rest/annotate


does not run the current statistical version of Spotlight but the old
Lucene version. You can check the current statistical version via the demo
[1] or the endpoint URL at the bottom of that page.

We should change that but I don't have access to that server, unfortunately.

Best,
Jo

[1] http://dbpedia-spotlight.github.io/demo/



On Tue, Mar 17, 2015 at 1:10 PM, Abhishek Gupta  wrote:

> Hi Thiago,
>
> Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
>
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
>
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century";. This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.
> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany " not "Germany
> ".
>
> 2) We are spotting "place" and associating it with "Portland Place
> ", maybe due to stemming SF.
> And even "Location (geography)
> " is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.
>
> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.
>
> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".
>
> 5) We are able to spot "1920s" as a surface form but not "1920".
>
> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document";.
>
> 2) Are we using surface forms to deal with only syntactic references (e.g.
> surface form "municipality" referring to "Municipality
> " or "Metropolitan_municipality
> " or "
> Municipalities_of_Mexico
> ") or both, syntactic
> and semantic references (e.g. aliases like "Third Reich" referring to "Nazi
> Germany ")?
>
> I am working on generating extra possible surface forms from
> a canonical surface form or the entity itself to deal with unseen SF
> association problems.
> I have also started working on my proposal will also submit it soon.
>
> Thanks,
> Abhishek
>
> On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery  wrote:
>
>> Hi Abhishek, thanks for the contribution. Your suggestions are pretty
>> much aligned with what we where thinking in any event, and the initial plan
>> seems good.
>> On the assumption that there's some code that generates extra possible
>> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
>> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
>> literature on Machine Translation on how to establish some score for the
>> surface form. That is, if you spot 'M Jordan' on the text, what is the
>> probability of it being a translation of the canonical name 'Michael
>> Jordan' .  If there's a simple way to implement this, we could try to get
>> the raw data with counts, generate some extra sfs in a principle manner and
>> use that to calculate probabilities. Still for the moment, I'd focus on
>> setting the spotlight server up and play with the warm up tasks.
>> Thanks for the good work,
>> Thiago
>>
>>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Medi

Re: [Dbpedia-gsoc] GSOC Introduction

2015-03-17 Thread Thiago Galery
Hi Abishek, thanks for joining. There's quite a lot of discussion going on
about these topics. I suggest you taking a look in previous treads and
searching for the names Thiago, and David. There's another student with the
same first name as you that has been asking many questions, so if you
search for him in the mailling list thread, you  might see questions that
you would have asked yourself. Once you have a better idea of what you want
to understand better, ping us and we'll do our best to help.
All the best,
Thiago

On Mon, Mar 16, 2015 at 4:14 PM, Marco Fossati  wrote:

> Hi Abhishek,
>
> We are already working on your pull request, thanks!
> Feel free to share any thoughts on this mailing list (except those
> specific to the repo code).
> Cheers!
>
> On 16 March 2015 at 14:28, Abhishek Tiwari  wrote:
>
>> Hi all,
>>
>> My name is Abhishek Tiwari. I am a fourth year undergraduate student at
>> IIT(BHU),Varanasi. I have been working on my semester project
>> "Identification of causal relation in natural language text with the help
>> of graph patterns". This project gave me experience of handling Stanford
>> parser(for chunking and obtaining parse tree format) and  SenseLearner(word
>> sense disambiguation).
>> Also I had learnt wide number of python libraries such as lxml, nltk ,
>> multiprocessing, networkx(for graph representation) and graph-tool. I also
>> had to use streaming API in Hadoop while writing  mapreduce in python in
>> order to manage large number of computations.
>>
>> Currently I have been trying the warmup tasks listed for 5.1 Fact
>> extraction from wikipedia text.
>>  Although I am also interested in NLP topics by dbpedia-spotlight:
>> 5.15 Better Context Vectors
>> 5.16 Better Surface Form Matching
>> 5.19 Confidence/Relevance Scores
>>
>> I am going to try warmup task for the topics. Please guide as how to best
>> understand the above topics.
>>
>> Regards,
>> Abhishek Tiwari
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> ___
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


[Dbpedia-gsoc] self-introduction

2015-03-17 Thread Jerry King
Hi all,
My name is Zerui Wang, or just call me Jerry. I am a graduate student
majoring in CS from Peking University, China.
I am good at python and use it a lot in my daily work. I am also familiar
with classification algorithms, such as SVM. I was an intern in Sina, a
China IT corporation, for 3 months to build a classification system with my
colleagues.
I am quite interested in 5.1, 5.2 and 5.9. I have read the papers
referenced in 5.1, downloaded code of both 5.1 and 5.2, and had a glance at
the code base. It seems that I don't have enough time for investigations on
5.9.
I am going to try warm-up tasks.

I am not very clear about "dynamic" in idea 5.2 and could you please
clarify the expectations of this idea?

Best wishes.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] self-introduction

2015-03-17 Thread Marco Fossati
Hi Jerry,

On 3/17/15 3:40 PM, Jerry King wrote:
> Hi all,
> My name is Zerui Wang, or just call me Jerry. I am a graduate student
> majoring in CS from Peking University, China.
> I am good at python and use it a lot in my daily work. I am also
> familiar with classification algorithms, such as SVM. I was an intern in
> Sina, a China IT corporation, for 3 months to build a classification
> system with my colleagues.
> I am quite interested in 5.1, 5.2 and 5.9. I have read the papers
> referenced in 5.1, downloaded code of both 5.1 and 5.2, and had a glance
> at the code base.
> It seems that I don't have enough time for
> investigations on 5.9.
> I am going to try warm-up tasks.
Cool! For idea 5.1, have a look at the submitted pull requests and try 
to pick an issue that is not already in progress.
Cheers!
>
> I am not very clear about "dynamic" in idea 5.2 and could you please
> clarify the expectations of this idea?
>
> Best wishes.
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Another GSoC Introduction

2015-03-17 Thread Marco Fossati
Hi Philipp,

Sounds good! Feel free to post your thoughts here and to submit pull 
requests to the repos you want to be involved into.
Cheers,

On 3/17/15 4:31 PM, Philipp Dowling wrote:
> Hey everyone,
>
> My name is Philipp, I'm from Germany and I'm happy to meet you guys! I
> mostly do work in computational linguistics and NLP, so DBpedia was one
> of the most interesting projects in GSoC this year for me. My main
> strengths and/or interests are continuous space vector models, neural
> networks and information retrieval.
>
> A little bit about me: I'm just about finished with my undergrad studies
> in Munich, and will start my masters next. Most recently, I was in Hong
> Kong for my B.Sc. thesis, conducting research on semantic MT evaluation.
> I also work at a local startup, building data mining and knowledge
> discovery systems.
>
> To be more specific about my interests for GSoC: I'm most interested in
> tasks 5.15, 5.1, 5.9 and 5.12 (roughly in that order).
> 5.15 especially overlaps a lot with research I've been doing for my
> thesis, where I investigated the performance of continuous space models
> such as Word2Vec as a replacement for discrete context vectors, with
> very positive results. I got very familiar with different vector models
> from this, and would love to now continue working on something like this
> in a knowledge mining context.
> I also got exposed to frame semantics a little in the same context, and
> I'm currently working on knowledge mining, so 5.1 would also be a very
> interesting project.
> I'll come back with more specific questions when I've gotten a chance to
> look at everything else in detail, but overall I'm very excited to start
> getting to work!
>
> I'll get into some of the warm up tasks as soon as I get a chance. I
> haven't worked with DBpedia much before, so it'll be interesting to dive
> into the code base.
>
> Cheers,
> Philipp Dowling
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>

-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


[Dbpedia-gsoc] Another GSoC Introduction

2015-03-17 Thread Philipp Dowling
Hey everyone,

My name is Philipp, I'm from Germany and I'm happy to meet you guys! I
mostly do work in computational linguistics and NLP, so DBpedia was one of
the most interesting projects in GSoC this year for me. My main strengths
and/or interests are continuous space vector models, neural networks and
information retrieval.

A little bit about me: I'm just about finished with my undergrad studies in
Munich, and will start my masters next. Most recently, I was in Hong Kong
for my B.Sc. thesis, conducting research on semantic MT evaluation. I also
work at a local startup, building data mining and knowledge discovery
systems.

To be more specific about my interests for GSoC: I'm most interested in
tasks 5.15, 5.1, 5.9 and 5.12 (roughly in that order).
5.15 especially overlaps a lot with research I've been doing for my thesis,
where I investigated the performance of continuous space models such as
Word2Vec as a replacement for discrete context vectors, with very positive
results. I got very familiar with different vector models from this, and
would love to now continue working on something like this in a knowledge
mining context.
I also got exposed to frame semantics a little in the same context, and I'm
currently working on knowledge mining, so 5.1 would also be a very
interesting project.
I'll come back with more specific questions when I've gotten a chance to
look at everything else in detail, but overall I'm very excited to start
getting to work!

I'll get into some of the warm up tasks as soon as I get a chance. I
haven't worked with DBpedia much before, so it'll be interesting to dive
into the code base.

Cheers,
Philipp Dowling
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc