Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-02 Thread David Przybilla
Hi Abhisek,

There is a lot of experimentation which can be done with both 5.16  and
5.17.

In my opinion the current problem is that the Surface Form(SF) matching is
a bit poor.
Mixing the Babelfy Superstring matching with other ideas to make SF
spotting better could be a great start.
You can also bring ideas from papers such as [1] in order to address more
linguistic variations.

It's hard to debate which one is better, however you can mix ideas  i.e:
use superstring matching to greedy match more Surface forms with more
linguistic variations, while using word2vec in the disambiguation stage.

Feel free to poke me if you would like to discuss in more detail :)


[1] https://aclweb.org/anthology/P/P11/P11-1095.pdf





On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta  wrote:

> Hi all,
>
> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
> should admit one thing that every idea is more interesting than the
> previous one. While I was looking out for ideas that interests me I found
> following ideas most fascinating and I wish I could work on all of them but
> unfortunately I couldn't:
>
> 1) 5.1 Fact Extraction from Wikipedia Text
>
> 2) 5.9 Keyword Search on DBpedia Data
>
> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>
> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>
> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>
> But in all these I found a couple of ideas interlinked, in other words one
> solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
> raw text to DBpedia entities so as to understand raw text and disambiguate
> senses or entities. So if we can address these two tasks efficiently then
> we can solve problems associated with these three ideas.
>
> Following are some methods which were there in the research papers
> mentioned in references of these ideas.
>
> 1) FrameNet: Identify frames (indicating a particular type of situation
> along with its participants, i.e. task, doer and props), and then identify
> Logical Units, and their associated Frame Elements by using models trained
> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
> Labeling.
>
> 2) Babelfy: Using a wide semantic network, encoding structural and lexical
> information of both type encyclopedic and lexicographic like Wikipedia and
> WordNet resp., we can also accomplish our tasks (EL and WSD). In this a
> graphical method along with some heuristics is used to extract out the most
> relevant meaning from the text.
>
> 3) Word2vec / Glove - Methods for designing word vectors based on the
> context. These are primarily employed for WSD.
>
> Moreover if those problems are solved then we can address keyword search
> (5.9) and Confidence Scoring (5.19) effectively as both require association
> of entities to the raw text which will provide concerned entity and its
> attributes to search with and the confidence score.
>
> So I would like to work on 5.16 or 5.17 which will encompass those two
> tasks (EL and WSD) and for this I would like to ask which method will be
> the best for these two tasks? According to me it is the babelfy method
> which will be appropriate for both of these tasks.
>
> Thanks,
> Abhishek Gupta
> On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:
>
>> Hi Abishek, if you are interested in contributing to any DBpedia project
>> or participating in Gsoc this year it might be a good idea to take a look
>> at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might help
>> you to specify how/where you can contribute. Hope this helps,
>> Thiago
>>
>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>> extracted Named Entities from raw text to populate knowledge base with new
>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>> familiar with programming languages like C, C++ and Java primarily.
>>>
>>> So I presume that I can contribute a lot towards extracting structured
>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>> primary goal.
>>>
>>> So can anyone please help me out where to start from so as to contribute
>>> towards this?
>>>
>>> Regards
>>> Abhishek Gupta
>>>
>>>
>>> --
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>> __

Re: [Dbpedia-gsoc] GSOC Aspirant

2015-03-04 Thread David Przybilla
Hi Juhi,

Any particular interest among those two?
In the description we referenced a few papers which could be used as
potential ideas.

DBpedia Spotlight – Better Surface Form Matching:

We are currently having some problems with linguistic variations. One of
the ideas (but it is not fixed so feel free to bring yours as well :] )
could be to use the superstring matching from the Babelfy paper. But It
would also be nice to mix it with some of the methods described in: [1]


DBpedia Spotlight – Better Context Vectors:

The quality of the context vectors seems a bit weird for some entities.
Values are not normalised, among other issues.
The aim here is to improve context vectors/disambiguation.

Some of the ideas could be: use glove/word2vec entity vectors.
But also maybe bringing some ideas from discourse parsing like the ones
described in  [2]


A good start if you are interested would probably be:

 - taking a quick look at the mentioned papers to get some ideas
 - play a bit with spotlight ( demo
http://dbpedia-spotlight.github.io/demo/ )
 - try to set it up locally, (compile it, run it)
 - trying to understand how the spotlight stores work. Probably the paper
describing how spotlight works should be a good overview [3]

[1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
[2] http://www.aclweb.org/anthology/C14-1213
[3] http://blog.semantic-web.at/wp-content/uploads/2011/09/p1_mendes.pdf

On Wed, Mar 4, 2015 at 9:23 AM, Marco Fossati  wrote:

> Hi Juhi,
>
> you should send all your inquiries in the mailing list (in CC).
> Please check out the following page with our project ideas:
> http://wiki.dbpedia.org/gsoc2015/ideas
>
> Cheers!
>
> On 3/3/15 2:58 PM, JUHI TANDON wrote:
> > Hello Marco,
> >
> > I am Juhi Tandon pursuing my major in Computational Linguistics from
> > IIIT Hyderabad. I am an NLP enthusiast and as such I found the projects
> > these projects particularly interesting :
> >
> >
> >   DBpedia Spotlight – Better Context Vectors and
> >
> >
> >   DBpedia Spotlight – Better Surface Form Matching
> >
> > I would like to contribute to one of these projects as a part of GSOC
> > 2015 Program. If the mentors can please provide some insights on where
> > to begin from.
> >
> > Thanks and Regards,
> >
> > Juhi
>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] DBpedia Spotlight – Better Tools for Model Creation - Ideas

2015-03-06 Thread David Przybilla
Hi Naveen,

In order to create a Spotlight Model we have to massage the Wikipedia dump
in order get some statistics out of it.
Those statistics include the probability of seeing a surface form, creating
a context vector for each entity..etc.

- It seems that the current script to do generate those models made on pig
is broken.  Check the issues below

- It seems there are other projects who could benefit from this framework
if done properly

- Spark is a good alternative, given that is easier to model map/reduce
problems but also because it is fast.

A good start would be :

 1. Play a bit with spotlight ( http://dbpedia-spotlight.github.io/demo/ )
 2. Check the warm up tasks
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
 3. Compile and run spotlight locally
 4. Understand what are  stores  and what values are inside them
 5. Take a look at the script generating the current model:
https://github.com/dbpedia-spotlight/pignlproc  .

Issues:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/329
https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/321


On Fri, Mar 6, 2015 at 8:59 PM, Naveen Madhire 
wrote:

> Hi Team,
>
>
> I am currently pursuing Masters in Data Science from Indiana University. I
> am very much interested in participating in this years GSOC and the idea
> listed on DBPedia's website caught my eye as I am confortable in Apache
> Spark, Entity linking and JAVA.
>
> DBpedia Spotlight – Better Tools for Model Creation
>
> I don't see any discussion happening in the archives.
>
> If possible can anyone share any references to look into and any details
> which will help me to understand the current project in detail.
>
> Please let me know.
>
>
> Thanks,
> Naveen M
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSoC 2015 - Introduction

2015-03-07 Thread David Przybilla
Hi Shashank,

On DBpedia Spotlight – Better Context Vectors:

Here are the DBPedia Spotlight warm tasks:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks

if you take a look at the github issue page you should find some of the
problems we are dealing with. One of the ideas could be experimenting with
word2vec.

Have a nice weekend :)

On Sat, Mar 7, 2015 at 11:46 AM, shashank juyal  wrote:

> Hi,
>
> I am a Masters student in International Institute of Information
> technology, Hyderabad (IIIT-H). I am interested in taking part in this
> year's GSOC. Many of the projects in DBPedia sounds very familiar and
> interesting to me as I have worked closely with many of the concepts and
> technologies used in the project.
>
> I have worked previously with Wikipedia data and built a small search over
> it based on tf-idf score and my own parser. Also currently I am working in
> a project "Question Answer techniques using NLP" which uses concepts like
> wordtovec, CBOW, NL Processing and translation to query language, which are
> mentioned in some of the projects in DBPedia-Spotlight.
>
> Based on this, I would like to work on the following projects:
>
> 1) Fact Extraction from Wikipedia Text
> 2) Keyword Search on DBpedia
> 3) Deploying a DBpedia Question Answering Engine
> 4) DBpedia Spotlight – Better Context Vectors
>
> Please let me know the warm-up tasks in the above projects.
>
> Linked Profile: in.linkedin.com/in/shajuyal
> Github Profile: https://github.com/sjuyal
>
> Thanks and Regards,
> Shashank Juyal
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSoC 2015

2015-03-09 Thread David Przybilla
Hi Oleksandr,

5.16, 5.17 both involve Scala + A bit of Natural Language Processing.
5.17 is more about being able to massage a wikipedia dump and getting
numbers out of it for Name entity recognition.



On Mon, Mar 9, 2015 at 9:27 AM, Dimitris Kontokostas 
wrote:

> Hi Oleksandr & welcome
>
> I'd suggest you narrow down your topics to very few 1-2 in order to be
> able to better focus on your final proposal.
> Let us know if you have any questions
>
> Cheers,
> DImitris
>
> On Sun, Mar 8, 2015 at 11:59 PM, Oleksandr Olgashko <
> alexandrolg...@gmail.com> wrote:
>
>> Hello,
>>
>> I'd like to investigate possibilities to participate in GSoC as part of
>> DBpedia organizations. Since I never participated in GSoC before, some
>> questions may sound naive.
>>
>> My name is Oleksandr Olgashko, I'm a first year master's student in Taras
>> Shevchenko National University of Kyiv (Ukraine). Some links about me:
>> https://github.com/dveim
>> https://www.linkedin.com/in/olgashko
>> https://www.coursera.org/user/i/d5878dc26bfe6cbe456d0e119d96e551
>>
>> My primary interests are machine learning (particularly, natural language
>> processing, what I was doing on previous project) and data analysis, also
>> I'm a fan of Scala programming language. DBpedia has most natural
>> combination of those skills.
>>
>> On your ideas page I've found several interesting projects, like 5.3,
>> 5.7, 5.14, 5.16, 5.17. Which of them are more relevant, so I can start
>> research deeper?
>>
>> Thanks for answers
>>
>>
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> ___
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
>
> --
> Kontokostas Dimitris
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-09 Thread David Przybilla
Hi Abhisek,

I guess you could try to implement the spotting/disambiguation on the same
step like the babelfy papers suggests.

Warm up tasks:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks


On Mon, Mar 9, 2015 at 8:04 AM, Axel Ngonga <
ngo...@informatik.uni-leipzig.de> wrote:

>  Hallo Abhishek,
>
> Cool that have you here! For the keyword search topic, please checkout
> * http://goo.gl/dPbP3F
> * http://dl.acm.org/citation.cfm?id=2488488
>
> Feel free to contact me for questions and/or a warm-up task.
>
> Best regards,
> Axel
>
> Hi all,
>
> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
> should admit one thing that every idea is more interesting than the
> previous one. While I was looking out for ideas that interests me I found
> following ideas most fascinating and I wish I could work on all of them but
> unfortunately I couldn't:
>
> 1) 5.1 Fact Extraction from Wikipedia Text
>
> 2) 5.9 Keyword Search on DBpedia Data
>
> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>
> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>
> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>
> But in all these I found a couple of ideas interlinked, in other words one
> solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
> raw text to DBpedia entities so as to understand raw text and disambiguate
> senses or entities. So if we can address these two tasks efficiently then
> we can solve problems associated with these three ideas.
>
> Following are some methods which were there in the research papers
> mentioned in references of these ideas.
>
> 1) FrameNet: Identify frames (indicating a particular type of situation
> along with its participants, i.e. task, doer and props), and then identify
> Logical Units, and their associated Frame Elements by using models trained
> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
> Labeling.
>
> 2) Babelfy: Using a wide semantic network, encoding structural and lexical
> information of both type encyclopedic and lexicographic like Wikipedia and
> WordNet resp., we can also accomplish our tasks (EL and WSD). In this a
> graphical method along with some heuristics is used to extract out the most
> relevant meaning from the text.
>
> 3) Word2vec / Glove - Methods for designing word vectors based on the
> context. These are primarily employed for WSD.
>
> Moreover if those problems are solved then we can address keyword search
> (5.9) and Confidence Scoring (5.19) effectively as both require association
> of entities to the raw text which will provide concerned entity and its
> attributes to search with and the confidence score.
>
> So I would like to work on 5.16 or 5.17 which will encompass those two
> tasks (EL and WSD) and for this I would like to ask which method will be
> the best for these two tasks? According to me it is the babelfy method
> which will be appropriate for both of these tasks.
>
> Thanks,
> Abhishek Gupta
> On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:
>
>>  Hi Abishek, if you are interested in contributing to any DBpedia
>> project or participating in Gsoc this year it might be a good idea to take
>> a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might
>> help you to specify how/where you can contribute. Hope this helps,
>>  Thiago
>>
>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta 
>> wrote:
>>
>>> Hi all,
>>>
>>>  I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>> extracted Named Entities from raw text to populate knowledge base with new
>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>> familiar with programming languages like C, C++ and Java primarily.
>>>
>>>  So I presume that I can contribute a lot towards extracting structured
>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>> primary goal.
>>>
>>>  So can anyone please help me out where to start from so as to
>>> contribute towards this?
>>>
>>>  Regards
>>>  Abhishek Gupta
>>>
>>>
>>> --
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>> ___
>>> Dbpedia-gsoc mailing list
>>> Dbpedia-gsoc@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>>
>
> --

Re: [Dbpedia-gsoc] DBpedia Spotlight – Better Tools for Model Creation - Ideas

2015-03-09 Thread David Przybilla
Hi Navin,

If you are playing with the model editor,
post this entry will useful both for understanding how the editor works and
how the stores interact with each other:
http://engineering.idioplatform.com/2015/02/23/spotlight-model-editor.html

On Sun, Mar 8, 2015 at 5:59 PM, Thiago Galery  wrote:

> Hi Naveen, in the wiki you can find the papers from the spotlight
> community that might shed light into that. In general terms, spotlight
> implements an entity mention model for entity linking and a very helpful
> paper for understanding that is this one
> https://aclweb.org/anthology/P/P11/P11-1095.pdf .
> All the best,
> Thiago
>
> On Sun, Mar 8, 2015 at 11:37 AM, Naveen Madhire 
> wrote:
>
>> Hi David,
>>
>>
>> I cloned the latest code from dbspotlight and did a build using Intellij.
>> I ran few examples present here
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Run-from-a-JAR
>>
>> However, I still didn't get the gist of how Spotlight works.
>>
>> I am going to play around the spotlight editor and see how it works.
>>
>> Do you have any more examples related to Spotlight.
>>
>>
>>
>> Thanks,
>> Naveen
>>
>> On Fri, Mar 6, 2015 at 7:05 PM, David Przybilla 
>> wrote:
>>
>>> Hi Naveen,
>>>
>>> In order to create a Spotlight Model we have to massage the Wikipedia
>>> dump in order get some statistics out of it.
>>> Those statistics include the probability of seeing a surface form,
>>> creating a context vector for each entity..etc.
>>>
>>> - It seems that the current script to do generate those models made on
>>> pig is broken.  Check the issues below
>>>
>>> - It seems there are other projects who could benefit from this
>>> framework if done properly
>>>
>>> - Spark is a good alternative, given that is easier to model map/reduce
>>> problems but also because it is fast.
>>>
>>> A good start would be :
>>>
>>>  1. Play a bit with spotlight ( http://dbpedia-spotlight.github.io/demo/
>>> )
>>>  2. Check the warm up tasks
>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>>>
>>>  3. Compile and run spotlight locally
>>>  4. Understand what are  stores  and what values are inside them
>>>  5. Take a look at the script generating the current model:
>>> https://github.com/dbpedia-spotlight/pignlproc  .
>>>
>>> Issues:
>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/329
>>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/321
>>>
>>>
>>> On Fri, Mar 6, 2015 at 8:59 PM, Naveen Madhire >> > wrote:
>>>
>>>> Hi Team,
>>>>
>>>>
>>>> I am currently pursuing Masters in Data Science from Indiana
>>>> University. I am very much interested in participating in this years GSOC
>>>> and the idea listed on DBPedia's website caught my eye as I am confortable
>>>> in Apache Spark, Entity linking and JAVA.
>>>>
>>>> DBpedia Spotlight – Better Tools for Model Creation
>>>>
>>>> I don't see any discussion happening in the archives.
>>>>
>>>> If possible can anyone share any references to look into and any
>>>> details which will help me to understand the current project in detail.
>>>>
>>>> Please let me know.
>>>>
>>>>
>>>> Thanks,
>>>> Naveen M
>>>>
>>>>
>>>> --
>>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>>> sponsored
>>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>>> for all
>>>> things parallel software development, from weekly thought leadership
>>>> blogs to
>>>> news, videos, case studies, tutorials and more. Take a look and join the
>>>> conversation now. http://goparallel.sourceforge.net/
>>>> ___
>>>> Dbpedia-gsoc mailing list
>>>> Dbpedia-gsoc@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>
>>>>
>>>
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Par

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-12 Thread David Przybilla
es, hence a case where smoothing might be a
>>> good call.
>>>
>>>
>>>
>>>>
>>>>
>>>> Regarding Project 5.16 (DBpedia Spotlight - Better Surface form
>>>> Matching):
>>>>
>>>> *How to deal with linguistic variation: lowercase/uppercase surface
>>>> forms, determiners, accents, unicode, in a way such that the right
>>>> generalizations can be made and some form of probabilistic structured can
>>>> be determined in a principled way?*
>>>> For dealing with linguistic variations we can calculate lexical
>>>> translation probability from all probable name mentions to entities in KB
>>>> as shown in Entity Name Model in [2].
>>>>
>>>> *Improve the memory footprint of the stores that hold the surface forms
>>>> and their associated entities.*
>>>> In what respect we are planning to improve footprints whether in terms
>>>> of space or association or something else?
>>>>
>>>> For this project I have a couple of questions in mind:
>>>> 1) Are we planning to improve the same model that we are using in
>>>> dbpedia-spotlight for entity linking?
>>>>
>>>
>>> Yes
>>>
>>>
>>>> 2) If not we can change the whole model itself to something else like:
>>>> a) Generative Model [2]
>>>> b) Discriminative Model [3]
>>>> c) Graph Based [4] - Babelfy
>>>> d) Probabilistic Graph Based
>>>>
>>>
>>> Incorporating something like (c) or (d) might be a good call, but might
>>> be way bigger than one summer.
>>>
>>>
>>>
>>>> 3) Why are we planning to store surface forms with associated entities
>>>> instead of finding associated entities during disambiguation itself?
>>>>
>>>
>>> No sure what you mean by that.
>>>
>>>
>>>> Besides this I would also like to know regarding warm-up task I have to
>>>> do.
>>>>
>>>
>>> If you check the pull request page in spolight, @dav009 has a PR which
>>> he claims to be a mere *Idea* but forces surface forms to be stemmed before
>>> storing. Pulling from that branch, recompiling, running spotlight and
>>> seeing some of the results would be a good start. You can also nag us on
>>> that issue about ideas you might have after you understand the code.
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Abhishek Gupta
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
>>>> [2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>>> [3]
>>>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
>>>> [4]
>>>> http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
>>>> [5] http://www.aclweb.org/anthology/D11-1072
>>>>
>>>> On Tue, Mar 3, 2015 at 2:01 AM, David Przybilla <
>>>> dav.alejan...@gmail.com> wrote:
>>>>
>>>>> Hi Abhisek,
>>>>>
>>>>> There is a lot of experimentation which can be done with both 5.16
>>>>>  and 5.17.
>>>>>
>>>>> In my opinion the current problem is that the Surface Form(SF)
>>>>> matching is a bit poor.
>>>>> Mixing the Babelfy Superstring matching with other ideas to make SF
>>>>> spotting better could be a great start.
>>>>> You can also bring ideas from papers such as [1] in order to address
>>>>> more linguistic variations.
>>>>>
>>>>> It's hard to debate which one is better, however you can mix ideas
>>>>>  i.e: use superstring matching to greedy match more Surface forms with 
>>>>> more
>>>>> linguistic variations, while using word2vec in the disambiguation stage.
>>>>>
>>>>> Feel free to poke me if you would like to discuss in more detail :)
>>>>>
>>>>>
>>>>> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently I checked o

Re: [Dbpedia-gsoc] GSoC 2015 - Introduction

2015-03-12 Thread David Przybilla
Hi Shashank,

It looks alright.
I think you can skip the Spark part, as you are not interested in the
project concerning the model building.

As for the specific project you selected I think best would be to:

- Understand how a spotlight model is divided (Surface form store, Context
Store, Candidate Store). Probably this blog [1] entry can help you  as well
as playing with [2]

- Also reading the main paper on which spotlight is based on (I previously
mentioned it but it is also mentioned in the literature at github)

[1]
http://engineering.idioplatform.com/2015/02/23/spotlight-model-editor.html
[2] https://github.com/idio/spotlight-model-editor

On Thu, Mar 12, 2015 at 1:35 PM, shashank juyal  wrote:

> Hi David,
>
> Please find attached the warm up tasks I have done.
> I am still involved in some of the issues and documentation. I have also
> mentioned those in the pdf.
> Please let me know if any other warm up task has to be done.
>
> Thanks and Regards,
> Shashank Juyal
>
>
>
> On Sun, Mar 8, 2015 at 12:36 AM, David Przybilla 
> wrote:
>
>> Hi Shashank,
>>
>> On DBpedia Spotlight – Better Context Vectors:
>>
>> Here are the DBPedia Spotlight warm tasks:
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>>
>> if you take a look at the github issue page you should find some of the
>> problems we are dealing with. One of the ideas could be experimenting with
>> word2vec.
>>
>> Have a nice weekend :)
>>
>> On Sat, Mar 7, 2015 at 11:46 AM, shashank juyal 
>> wrote:
>>
>>> Hi,
>>>
>>> I am a Masters student in International Institute of Information
>>> technology, Hyderabad (IIIT-H). I am interested in taking part in this
>>> year's GSOC. Many of the projects in DBPedia sounds very familiar and
>>> interesting to me as I have worked closely with many of the concepts and
>>> technologies used in the project.
>>>
>>> I have worked previously with Wikipedia data and built a small search
>>> over it based on tf-idf score and my own parser. Also currently I am
>>> working in a project "Question Answer techniques using NLP" which uses
>>> concepts like wordtovec, CBOW, NL Processing and translation to query
>>> language, which are mentioned in some of the projects in DBPedia-Spotlight.
>>>
>>> Based on this, I would like to work on the following projects:
>>>
>>> 1) Fact Extraction from Wikipedia Text
>>> 2) Keyword Search on DBpedia
>>> 3) Deploying a DBpedia Question Answering Engine
>>> 4) DBpedia Spotlight – Better Context Vectors
>>>
>>> Please let me know the warm-up tasks in the above projects.
>>>
>>> Linked Profile: in.linkedin.com/in/shajuyal
>>> Github Profile: https://github.com/sjuyal
>>>
>>> Thanks and Regards,
>>> Shashank Juyal
>>>
>>>
>>> --
>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>> sponsored
>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>> for all
>>> things parallel software development, from weekly thought leadership
>>> blogs to
>>> news, videos, case studies, tutorials and more. Take a look and join the
>>> conversation now. http://goparallel.sourceforge.net/
>>> ___
>>> Dbpedia-gsoc mailing list
>>> Dbpedia-gsoc@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSoC

2015-03-13 Thread David Przybilla
Hi Denis,

On  the 1st and 3rd there have been some comments which you can take a look
at.  On the 2nd one there is currently a vanilla idea, which can be
improved in several ways. if you want to know more details please let us
know.

Regardless of the project a good start would be to take a look at the
warmup tasks[1]

[1]
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks



On Fri, Mar 13, 2015 at 1:12 PM, Thiago Galery  wrote:

> Hi Dennis, thanks for your interest, the subjects you picked have been
> discussed a lot in a thread by a prospective student named Abhishek Gupta.
> I suggest you taking a look at some threads in the mailing list here
> http://sourceforge.net/p/dbpedia/mailman/dbpedia-gsoc/?viewmonth=201503
> looking for replies with his name, mine (Thiago Galery), or David Pryzbilla
> 's.
> If you have any questions, feel free to email the mailing list and we'll
> get back to you as soon as we can.
> All the best,
> Thiago
>
> On Fri, Mar 13, 2015 at 4:52 AM, Денис Тисов 
> wrote:
>
>> Hello!
>>
>> My name is Denis. I am Java Developer and simultaneously fourth-year
>> student. I live in Moscow, Russia.
>>
>> Now I work at NetCracker in part time as Software Engineer (Java
>> Developer). I am interesting in Java technologies. In Autumn of 2014 I had
>> learned at NetCracker Learning Centre. As part of learning I with team had
>> to develop study web project. I developed Aspect-Oriented Data structure in
>> Oracle DB and integrate it with JPA (knew about composite primary key
>> etc.). Also I developed authorisation for this project via Spring Security.
>> For  MVC we used Spring MVC, JSP, DAO.
>>
>> Also for fun I made some mini-projects with JMS, hibernate, Stateless and
>> Stateful Session Beans.
>>
>> By the way on the last I took one of the second places in the
>> Mathematical Olympiad for future Masters at MIPT.
>>
>> Skills: Java Core (Collections, Concurrency, reflections, XML, RegExp
>> etc), Spring Dependency Injection, Spring Security, Spring MVC, JPA, JDBC,
>> JSP, Servlets, JQuery, Ajax, OOP, Design Patterns.
>>
>> If i don't know some technologies it's not a problem. I study fast.
>>
>> I would like to ask you give me now some tasks (may be bug fixes or
>> develop something) that I can prove my interest in your project. I read
>> about your projects, I liked these projects:
>> In order of priority for me:
>> 1. DBpedia Spotlight - Better Surface Form Matching
>> 2. DBpedia Spotlight - Confidence/Relevance Scores
>> 3. DBpedia Spotlight - Better Context Vector.
>>
>> Sorry about my English.
>>
>> Very truly yours Denis Tisov.
>>
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSoC 2015

2015-03-16 Thread David Przybilla
Hi Olek,



On Sat, Mar 14, 2015 at 10:50 PM, Oleksandr Olgashko <
alexandrolg...@gmail.com> wrote:

> So far, I have done warm-up tasks, and now going to dig source code.
> Could you please review my thoughts about 5.19 task?
>
> (i) Is it interesting to have a unifying value for the selected candidate?
> How would you combine the values from the filters that are already in place
> ?
> If I do not miss anything, the route is as follows: 1) create some
> annotated set of entities 2) design combining function for those three
> values, e.g. mean 3) play with function and coefficients, to find best
> suitable.
>

So yes, the current pipeline is:

  1. Get some surface forms
  2. Match those surface forms into Candidate Topics
  3. Get the contexts of the candidate topics
  4. Use a disambiguation function to calculate some scores ( FinalScore,
SecondPercetangeRank)
 5. Filter Topics below the certain thresholds

There are some tools for finding the best set of parameters(confidence,
support..etc) for a given set of annotated data. i.e:
https://github.com/diegoceccarelli/dexter-eval

In our case we have seen some dodgy quality of the vectors used during
disambiguation, which makes it a bit hard regardless of how good is the
function you could design.
There could be other methods of disambiguation which do not rely
necessarily on context vectors or that use them with other information i.e:
Graph Information..


>
> (ii) can the notion of entity relevance be equated with that of confidence
> ?
> In general, no, that depends on how both are calculated. However, in case
> of entity recognition, relevance of guess is derived from the features
> (e.g. if word ends with "-er", that gives several points in favor that we
> are talking about profession) + algorithm for context, so these concepts
> are same.
> One of possible ways (from specific to DBpedia) to increase the precision
> of algorithm is to find "number of transitions in Wikipedia" between words
> in context. Am I thinking in right direction?
>

I agree that confidence != Relevance.
Im not sure what you mean with :
""" "number of transitions in Wikipedia" between words in context. """
Do you mean distance between Topics in the DBpedia Graph ?


>
>

> By the way, if I choose in online demo `Confidence` -> 0, select `n-best`
> and press `Annotate`, what the numbers in dropdown list means? For example,
> for word `First` first two are World War I (1.00) and Football League First
> Division (1.45e-7)
>
> This corresponds to a score named `finalScore` it is based on the context
vectors and a value called `percentageOfSecondrank` which estimates  the
percentage of the finalScore of the next-best entity compared to the
finalScore of the current.

If you hit the candidates endpoint you can get all of these scores. here is
an example:

http://spotlight.sztaki.hu:/rest/candidates?confidence=0.0&text=First%20documented%20in%20the%2013th%20century,%20Berlin%20was%20the%20capital%20of%20the%20Kingdom%20of%20Prussia%20(1701%E2%80%931918),%20the%20German%20Empire%20(1871%E2%80%931918),%20the%20Weimar%20Republic%20(1919%E2%80%9333)%20and%20the%20Third%20Reich%20(1933%E2%80%9345).%20Berlin%20in%20the%201920s%20was%20the%20third%20largest%20municipality%20in%20the%20world.%20After%20World%20War%20II,%20the%20city%20became%20divided%20into%20East%20Berlin%20--%20the%20capital%20of%20East%20Germany%20--%20and%20West%20Berlin,%20a%20West%20German%20exclave%20surrounded%20by%20the%20Berlin%20Wall%20from%201961%E2%80%9389.%20Following%20German%20reunification%20in%201990,%20the%20city%20regained%20its%20status%20as%20the%20capital%20of%20Germany,%20hosting%20147%20foreign%20embassies
.


>
> 2015-03-09 14:52 GMT+02:00 Oleksandr Olgashko :
>
>> Found warm-up tasks for DBpedia Spotlight, sorry for inconvenience
>>
>> 2015-03-09 13:06 GMT+02:00 Oleksandr Olgashko :
>>
>>> Thanks for answers,
>>>
>>> On previous project I was working on several named entity recognition
>>> classifiers (naive Bayes and conditional random field based, we used
>>> Ontonotes corpus data), also I have brief experience with Apache Spark.
>>> So, probably, 5.16 and 5.17 would be most suitable for me, and 5.14 is
>>> worth to think about.
>>> Could you please give some warm-up tasks for these ideas?
>>> Also, is it possible to use Stanford NLP (GPL license?)
>>>
>>> 2015-03-09 12:42 GMT+02:00 David Przybilla :
>>>
>>>> Hi Oleksandr,
>>>>
>>>> 5.16, 5.17 both involve Scala + A bit of Natural Language Processing.
>>>> 5.17 is more about being able to massage a wikipedia dump and getting
>>>> numbers out of it for Name entity recognition.
>&g

Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-04-28 Thread David Przybilla
Hi Abhishek,

You are free to contribute :) I will try to keep on reviewing PRs
if that is alright.



On Tue, Apr 28, 2015 at 7:47 AM, Abhishek Gupta  wrote:

> Hi all,
>
> My proposal has not been selected for GSoC. But I am still want to
> continue with my project. So can someone provide me any guidelines (if I
> can continue)?
>
> Thanks,
> Abhishek
>
> On Thu, Apr 9, 2015 at 11:53 PM, Abhishek Gupta  wrote:
>
>> Hi Thiago,
>>
>> Thanks for your reply and assurance.
>> Moreover I replied your question for the extraction framework and I have
>> also created an issue regarding using bold instances as the probable
>> surface forms here
>> .
>>
>> Thanks,
>> Abhishek
>>
>> On Thu, Apr 9, 2015 at 1:19 AM, Thiago Galery  wrote:
>>
>>> Hi Abhishek,
>>> sorry for taking so long to write to you. Things at work have been
>>> really busy. About the issue you raised about the originality of your
>>> proposal, rest assured that no one sent a proposal similar to yours.
>>>
>>> I'm happy that you send a PR for the extraction framework. It seems that
>>> Dimitris is already taking a look at it.
>>> As for your suggestions in Spotlight, just removing the stopword filter
>>> is something that I don't advise that much, cause I remember getting a lot
>>> of crap once. Maybe it should be modified somehow. If you have a good idea
>>> and want to send a PR, it would be very welcome. I think discussing things
>>> on github would be better.
>>>
>>> All the best,
>>> Thiago
>>>
>>> On Mon, Apr 6, 2015 at 6:15 AM, Abhishek Gupta 
>>> wrote:
>>>
 Hi all,

 Recently I was checking out the indexing process of dbpedia-spotlight
 and I observe a certain things:

 1) There is a missing constructor definition in wikiPage object
 
  for
 instance defined in function wikiPageCopy here
 .
 For this I have created an PR
 https://github.com/dbpedia/extraction-framework/pull/377

 2) For stopwords filter defined here
 ,
 I did an analysis over the conceptURI's extraction with stopwords list
 here
 .
 From the analysis it came out that we are neglecting around 25481 entities
 in which almost all of them are from important category like music, film,
 band etc. E.g. Am_(musician)
 , Home_(2015_film)
 , The_Who
  etc. And if we do case
 sensitive checking (checking if entity contains more than one capital
 alphabets as one is default) even then we will reject some entities which
 has only one word like Am, Home etc. Moreover the garbage (can't etc.) we
 will incur after removing this filter won't be much. So i suggest if we can
 remove this filter.

 3) I would like to suggest a surface form extraction. If we can extract
 bold text in the first line of the wikipedia then we can use that as
 probable Surface Form for that entity. E.g. Stanford_University
 , Aon_(company)
 , Radio_Warwick
 , Phi_Gamma_Delta
  etc. These are the best
 Surface Forms for the respective Entity.

 Thanks,
 Abhishek

 On Fri, Mar 27, 2015 at 11:56 AM, Abhishek Gupta 
 wrote:

> Hi all,
>
> I would also like to inform that in one of the recent mails my
> proposal has been gone public when Thiago accidentally sent a mail to me
> and dbpedia-gsoc mailing list. Details of the mails are below. The Google
> docs link was there in the quotes and the doc can be seen and even edited
> by anyone with that link, but nobody have changed the content of the doc.
> And I believe there might be chances that someone will copy my ideas. So
> I request you to take care of this issue. And I hope this might not
> affect my application.
> As of now I have changed the sharing settings, so please inform me if
> there will be any access problem.
>
> *Mail details:*
> from:Thiago Galery to:Abhishek Gupta <
> a.gu...@gmail.com>,
> dbpedia-gsoc 
> date:Tue, Mar 24, 2015 at 3:47 AMsubject:Re: [Dbpedia-gsoc] Fwd:
> Contribute to DbPedia
>
> I have also modified my

Re: [Dbpedia-gsoc] Interested in contributing

2016-02-04 Thread David Przybilla
Hi Shweta,

I am not sure what are the initial ideas sketches for this year

As a rule of a thumb I suggest :

 - Checking DBpedia and DBpedia-Spotlight mailing lists
 - Checking Github repos to see what is going on [1] [2]
 - join slack chats (Im not sure if these are meant to be public) I will
ask around and send you a link if you are interested.
- Last year's GSoC ideas [3]

As far as Spolight is concerned, there is probably a lot to do regarding
architecture, CI, but also a lot of space to try NLP technology. There are
lots of issues regarding evaluation as well.
Regardless,  a good  start would be to take a look at the issues going on
Github.


[1] https://github.com/dbpedia-spotlight
[2] https://github.com/dbpedia
[3] http://wiki.dbpedia.org/gsoc2015/ideas



On Thu, Feb 4, 2016 at 4:03 PM, Shweta Oak  wrote:

> Hi,
>
> I am interested in working on a project for DBPedia and DBPedia Spotlight
> for GSoC '16. I would like to contribute to a project.
>
> I am currently interning at Mozilla. I work on the Kinto project. My work
> mainly focuses on the automatic discovery of Kinto servers.
>
> I have worked on a skin disease detection project, in which skin diseases
> are detected from images sent by a user. This was an android application.
> It involved machine learning - we used neural networks.
>
> I am also interested in natural language processing and data analytics. I
> plan to pursue these subjects for further studies too.
>
> I would be great if you could suggest someplace I can start working on or 
> could
> direct me to the kind of projects or the domains the organization would
> include for this year's gsoc session, so that I could focus more on solving
> issues, understanding how the organization works in those areas, and be
> able to deliver better results.
>
> --
> Regards,
>
> Shweta Oak
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc