Re: [Dbpedia-gsoc] Re-Introduction

2016-03-01 Thread Thiago Galery
Hi Felix, welcome back. We might add a few warm up tasks soon. Phillip's
project might be merged on the dev branch soon and will provide a good
basis for future additions. If you are interested in spotlight, are there
any ideas on what aspects of it you'd like to concentrate on ?

On Tue, Mar 1, 2016 at 12:36 PM, Marco Fossati 
wrote:

> Hey Felix, welcome back!
>
> Marco
>
> On 3/1/16 02:13, Felix Sonntag wrote:
> > Hi everyone,
> >
> > I’m Felix, I already introduced myself last year, but I guess I’ll
> shortly reintroduce myself. I’m a Master student in Informatics at TUM in
> Munich. I’m pretty excited about the DBpedia project: I’m using Spotlight
> for a project about analyzing artist data at the moment, and I’ve used
> DBpedia data for an app last year. I already tried to participate in GSOC
> with you last year, unfortunately it didn’t work out (apparently it was
> really close :P).
> >
> > I’ve just finished my first Master semester with putting a focus on ML,
> Data Analytics and NLP in my studies.
> >
> > There are some project ideas I’m keen on working, but I’ll directly post
> on the project sites.
> >
> > One general question: for the Spotlight project there exists only a
> rough idea by Philipp, and there are also no warm up tasks. Can we expect
> more from that? :)
> >
> > Best,
> > Felix
> >
> --
> > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > ___
> > Dbpedia-gsoc mailing list
> > Dbpedia-gsoc@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-23 Thread Thiago Galery
Hi Abishek, I suggest you submitting a proposal straight to Gsoc and us
commenting there. If you have done already, could you send us the link?
All the best,
Thiago

On Mon, Mar 23, 2015 at 8:54 AM, Abhishek Gupta  wrote:

> Hi all,
>
> Here are some comments for your response:
>
>
>> Hi Abishek, thanks for the work, here are some answers:
>>
>> On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta 
>> wrote:
>>
>>> Hi Thiago,
>>>
>>> Sorry for the delay!
>>> I have set up the spotlight server and it is running perfectly fine but
>>> with minimal settings. After this set up I played with spotIight server
>>> during which I came across some discrepancies as follows:
>>>
>>> Example taken:
>>> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
>>> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
>>> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
>>> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
>>> the world. In 1990 German reunification took place in whole Germany in
>>> which the city regained its status as the capital of Germany.
>>>
>>> 1) If we run this we annotate "13th Century" to "
>>> http://dbpedia.org/page/19th_century";. This might be happening because
>>> the context is very much from 19th century and moreover in "13th Century"
>>> and "19th Century" there is minimal syntactic difference (one letter).
>>> But I am not sure whether this is good or bad.
>>>
>>
>> This might be due to either "13th Century" being wrongly linked to 19th
>> century, or maybe the word "century" being linked to many different
>> centuries which then causes a disambiguation error due to the context. I
>> think your example is a counter-example to the way we generate the data
>> structures used for disambiguation.
>>
>>
>>> In my opinion if we have an entity in our store (
>>> http://dbpedia.org/page/13th_century) which is perfectly matching with
>>> surface form in raw text ("13th Century") we should have annotated SF
>>> to the entity.
>>> And same might be the case with "Germany" which is associated to "History
>>> of Germany " not "Germany
>>> ".
>>>
>>
>> In this case other factors might have crept in, in could be that Germany
>> has a bigger number of inlinks or some other metric that allows it to
>> overtake the most natural candidate.
>>
>>
>>>
>>> 2) We are spotting "place" and associating it with "Portland Place
>>> ", maybe due to stemming
>>> SF. And even "Location (geography)
>>> " is not the correct
>>> entity type for this. This is because we are not able to detect the sense
>>> of the word "place" itself. So for that we may have to use word senses
>>> like from Wordnet etc.
>>>
>>
>> The sf spottling pipeline works a bit like this, you get a candidate SF,
>> like 'Portland Place' and see if there's a candidate for that, but you also
>> consider n-gram subparts, so it could have retrieved the candidates
>> associated with "place" instead.
>>
>
> I understand what you said but over here I wanted to point out that "
> place" is not even a noun and we are trying to associate it with an Named
> Entity which is a noun.
>
>
>>
>>
>>>
>>> 3) We are detecting ". Berlin" as a surface form. But I don't came to
>>> know where this SF comes from. And I suspect this SF doesn't come from the
>>> Wikipedia.
>>>
>>
>> Although ". Berlin" is highlighted, the entity is matched on "Berlin",
>> the extra space and punctuation comes from the way we tokenize sentences.
>> We have chosen to use a language independent tokenizer using a break
>> iterator for speed and language independence, but it hasn't been tested
>> very well. This is the area which explains this mistake and help in it is
>> much appreciated.
>>
>
> Thanks for clarification.
>
>
>>
>>
>>>
>>> 4) We spotted "capital of Germany" but I didn't get any candidates if
>>> we run for "candidates" instead of "annotate".
>>>
>>
>> This might be due to a default confidence score. If you pass the extra
>> confidence param and set it to 0, you will probably see everything, e.g.
>> /candidates/?confidence=0&text=
>> In fact, I suggest you to see all the candidates in the text you used to
>> confirm (or not) what I've been saying here.
>>
>
> I tried to do that but I still didn't get any Entity Candidate for "capital
> of Germany".
>
>
>>
>>>
>>> 5) We are able to spot "1920s" as a surface form but not "1920".
>>>
>>
>> This is due to the generation /stemming of sfs we have been discussed,
>> but I'm not sure that is a bad example. 1920 if used as a year might no
>> mean the same as 1920s.
>>
>
> This was my mistake.
>
>
>>
>>
>
>>> Few more questions:
>>> 1) Are we trying to annotate every word, noun or entity(e.g. proper
>>> noun) in raw text? Because in the above link I found "documented" (a word
>>> not a no

Re: [Dbpedia-gsoc] Introduction

2015-03-23 Thread Thiago Galery
Hi Minjeong, I suggest you taking a look at the previous messages in the
mailing list archives and check out the discussion there, so you have a
better idea of what to do. Bare in mind that submission date is really
close, so you'd need to look into this asap.
All the best,
Thiago

On Mon, Mar 23, 2015 at 2:04 PM, Minjeong Kim  wrote:

> Hi all!
>
> I`m MJ from South Korea. I`m a CSE student in Kyungpook National
> University.
> I recently knew about a GSoC program and i found DBpedia project i wish to
> contribute.
>
> >From last semester, our team developing Question Answering System solving
> quizzes like IBM`s Watson.
> While developing, i became more interested in Natural Language Processing,
> Information Retrieval, Machine Learning.
>
> I looked through all ideas and i would like to participate in 5.1 or 5.7
> or 5.9 or DBpedia Spotlight Idea.
> I`m gonna work on warm-up tasks right away.
>
> Thanks,
> MJ
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSOC 2015 - Introduction

2015-03-23 Thread Thiago Galery
Hi Vasanth, I suggest you taking a look at the previous messages in the
mailing list archives and check out the discussion there, so you have a
better idea of what to do. Bare in mind that submission date is really
close, so you'd need to look into this asap.
All the best,
Thiago

On Mon, Mar 23, 2015 at 5:07 PM, Vasanth Kalingeri <
vasanth.kaling...@gmail.com> wrote:

> Hi,
> My name is Vasanth Kalingeri. I am a 3rd year undergrad in
> computer science, pursuing my engineering in SJCE Mysore. I have completed
> a course on machine learning in Coursera, which further lead me into an
> interest towards NLP. I am also freelancing since 2 years.
> My interest for NLP grew primarily when I wanted a knowledge base
> from a given corpus of text, so that it could answer questions on the
> corpus. This lead me to dbpedia and further into the topic 5.1.
> I am extremely interested in building such a system to extract
> facts from a corpus. Will get working on the warmup tasks soon.
> Regards,
> Vasanth
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSOC Introduction

2015-03-17 Thread Thiago Galery
Hi Abishek, thanks for joining. There's quite a lot of discussion going on
about these topics. I suggest you taking a look in previous treads and
searching for the names Thiago, and David. There's another student with the
same first name as you that has been asking many questions, so if you
search for him in the mailling list thread, you  might see questions that
you would have asked yourself. Once you have a better idea of what you want
to understand better, ping us and we'll do our best to help.
All the best,
Thiago

On Mon, Mar 16, 2015 at 4:14 PM, Marco Fossati  wrote:

> Hi Abhishek,
>
> We are already working on your pull request, thanks!
> Feel free to share any thoughts on this mailing list (except those
> specific to the repo code).
> Cheers!
>
> On 16 March 2015 at 14:28, Abhishek Tiwari  wrote:
>
>> Hi all,
>>
>> My name is Abhishek Tiwari. I am a fourth year undergraduate student at
>> IIT(BHU),Varanasi. I have been working on my semester project
>> "Identification of causal relation in natural language text with the help
>> of graph patterns". This project gave me experience of handling Stanford
>> parser(for chunking and obtaining parse tree format) and  SenseLearner(word
>> sense disambiguation).
>> Also I had learnt wide number of python libraries such as lxml, nltk ,
>> multiprocessing, networkx(for graph representation) and graph-tool. I also
>> had to use streaming API in Hadoop while writing  mapreduce in python in
>> order to manage large number of computations.
>>
>> Currently I have been trying the warmup tasks listed for 5.1 Fact
>> extraction from wikipedia text.
>>  Although I am also interested in NLP topics by dbpedia-spotlight:
>> 5.15 Better Context Vectors
>> 5.16 Better Surface Form Matching
>> 5.19 Confidence/Relevance Scores
>>
>> I am going to try warmup task for the topics. Please guide as how to best
>> understand the above topics.
>>
>> Regards,
>> Abhishek Tiwari
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> ___
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
>
> --
> Marco Fossati
> http://about.me/marco.fossati
> Twitter: @hjfocs
> Skype: hell_j
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


[Dbpedia-gsoc] Fwd: Contribute to DbPedia

2015-03-17 Thread Thiago Galery
-- Forwarded message --
From: Thiago Galery 
Date: Tue, Mar 17, 2015 at 11:29 AM
Subject: Re: [Dbpedia-gsoc] Contribute to DbPedia
To: Abhishek Gupta 


Hi Abishek, thanks for the work, here are some answers:

On Tue, Mar 17, 2015 at 9:10 AM, Abhishek Gupta  wrote:

> Hi Thiago,
>
> Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
>
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
>
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century";. This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.
>

This might be due to either "13th Century" being wrongly linked to 19th
century, or maybe the word "century" being linked to many different
centuries which then causes a disambiguation error due to the context. I
think your example is a counter-example to the way we generate the data
structures used for disambiguation.


> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany <http://dbpedia.org/page/History_of_Germany>" not "Germany
> <http://dbpedia.org/page/Germany>".
>

In this case other factors might have crept in, in could be that Germany
has a bigger number of inlinks or some other metric that allows it to
overtake the most natural candidate.


>
> 2) We are spotting "place" and associating it with "Portland Place
> <http://dbpedia.org/resource/Portland_Place>", maybe due to stemming SF.
> And even "Location (geography)
> <http://dbpedia.org/page/Location_(geography)>" is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.
>

The sf spottling pipeline works a bit like this, you get a candidate SF,
like 'Portland Place' and see if there's a candidate for that, but you also
consider n-gram subparts, so it could have retrieved the candidates
associated with "place" instead.


>
> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.
>

Although ". Berlin" is highlighted, the entity is matched on "Berlin", the
extra space and punctuation comes from the way we tokenize sentences. We
have chosen to use a language independent tokenizer using a break iterator
for speed and language independence, but it hasn't been tested very well.
This is the area which explains this mistake and help in it is much
appreciated.


>
> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".
>

This might be due to a default confidence score. If you pass the extra
confidence param and set it to 0, you will probably see everything, e.g.
/candidates/?confidence=0&text=
In fact, I suggest you to see all the candidates in the text you used to
confirm (or not) what I've been saying here.


>
> 5) We are able to spot "1920s" as a surface form but not "1920".
>

This is due to the generation /stemming of sfs we have been discussed, but
I'm not sure that is a bad example. 1920 if used as a year might no mean
the same as 1920s.


>
> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document";.
>
>
There are two main spotters, the default one that uses a finite state
automaton generated from the surface form store to match inc

Re: [Dbpedia-gsoc] GSoC

2015-03-13 Thread Thiago Galery
Hi Dennis, thanks for your interest, the subjects you picked have been
discussed a lot in a thread by a prospective student named Abhishek Gupta.
I suggest you taking a look at some threads in the mailing list here
http://sourceforge.net/p/dbpedia/mailman/dbpedia-gsoc/?viewmonth=201503
looking for replies with his name, mine (Thiago Galery), or David Pryzbilla
's.
If you have any questions, feel free to email the mailing list and we'll
get back to you as soon as we can.
All the best,
Thiago

On Fri, Mar 13, 2015 at 4:52 AM, Денис Тисов  wrote:

> Hello!
>
> My name is Denis. I am Java Developer and simultaneously fourth-year
> student. I live in Moscow, Russia.
>
> Now I work at NetCracker in part time as Software Engineer (Java
> Developer). I am interesting in Java technologies. In Autumn of 2014 I had
> learned at NetCracker Learning Centre. As part of learning I with team had
> to develop study web project. I developed Aspect-Oriented Data structure in
> Oracle DB and integrate it with JPA (knew about composite primary key
> etc.). Also I developed authorisation for this project via Spring Security.
> For  MVC we used Spring MVC, JSP, DAO.
>
> Also for fun I made some mini-projects with JMS, hibernate, Stateless and
> Stateful Session Beans.
>
> By the way on the last I took one of the second places in the Mathematical
> Olympiad for future Masters at MIPT.
>
> Skills: Java Core (Collections, Concurrency, reflections, XML, RegExp
> etc), Spring Dependency Injection, Spring Security, Spring MVC, JPA, JDBC,
> JSP, Servlets, JQuery, Ajax, OOP, Design Patterns.
>
> If i don't know some technologies it's not a problem. I study fast.
>
> I would like to ask you give me now some tasks (may be bug fixes or
> develop something) that I can prove my interest in your project. I read
> about your projects, I liked these projects:
> In order of priority for me:
> 1. DBpedia Spotlight - Better Surface Form Matching
> 2. DBpedia Spotlight - Confidence/Relevance Scores
> 3. DBpedia Spotlight - Better Context Vector.
>
> Sorry about my English.
>
> Very truly yours Denis Tisov.
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-12 Thread Thiago Galery
Hi Abhishek, thanks for the contribution. Your suggestions are pretty much
aligned with what we where thinking in any event, and the initial plan
seems good.
On the assumption that there's some code that generates extra possible
surface forms from a cannonical surface form, like your 'Michael Jordan' ->
'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
literature on Machine Translation on how to establish some score for the
surface form. That is, if you spot 'M Jordan' on the text, what is the
probability of it being a translation of the canonical name 'Michael
Jordan' .  If there's a simple way to implement this, we could try to get
the raw data with counts, generate some extra sfs in a principle manner and
use that to calculate probabilities. Still for the moment, I'd focus on
setting the spotlight server up and play with the warm up tasks.
Thanks for the good work,
Thiago
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-11 Thread Thiago Galery
need to do some improvements on the context store as well. This is why
there's another gsoc Idea related to that.


> *Approach 2: This approach have might have high Time Complexity*
>
> Instead of finding candidate entities without using context we can use our
> context to some extent.
> 1) We locate our context in a connected entities graph using the context
> of sequence of tokens.
> 2) Find all entities linked to our context and they will be our candidate
> entities using Babelfy approach.
> 3) Pass all the candidate entities to the function mentioned in step 2 of
> Approach 1.
> 4) Pass SoT from the same function (part (a) and (b))
> 5) Score candidates using Levenshtein Distance
>
> Actually in Approach 2 we are doing a bit of disambiguation in Step 1
> itself which will reduce our count of sFFs.
>
> Please review these ideas and provide your feedback.
>
>
I'm not sure whether I understand this entirely, but I'm very interested in
other ways to conceptualise context. Spotlight just uses a simple
distributional method, but you can definitely use the link structure within
wikipedia to find candidates that are more related to themselves. In your
example above the pair Motorcycle - Harley Davidson would be much more
related than Motorcycle - Hard Drive for example. However, this would
require coding from scratch, so bear in mind that it might be too much
work.


> Moreover I am trying to setting up the server on my PC itself which is
> taking some time due to a 10Gb file. I will come up with results as soon as
> I got some results. Till then I might follow up with some other warm-up
> task which is related to project ideas 5.15 and 5.16.
>
> Regards,
> Abhishek
>

Let us know if you need any help.

All the best,

Thiago Galery

>
> On Sun, Mar 8, 2015 at 11:47 PM, Thiago Galery  wrote:
>
>> Hi Abhishek, here are some thoughts about some of your questions:
>>
>> I would like to ask a few questions:
>>> 1) Are we designing these vectors to use in the disambiguation step of
>>> Entity Linking (matching raw text entity to KB entity) or Is there any
>>> other task we have in mind where these vectors can be employed?
>>>
>>
>>
>> The main focus would be disambiguation, but one could reuse the
>> contextual score of the entity to determine how relevant that entity is for
>> the text.
>>
>>
>>
>>> 2) At present which model is used for disambiguation in
>>> dbpedia-spotlight?
>>>
>>
>>
>> Correct me if I am wrong, but I think that disambiguation is done by
>> cosine similarity (on term frequency) between the context surrounding the
>> extracted surface form and the context associated with each candidate
>> entity associated with that surface form.
>>
>>
>>> 3) Are we trying to focus on modelling context vectors for infrequent
>>> words primarily as there might not have enough information hence difficult
>>> to model?
>>>
>>
>> The problem is not related to frequent words per se, but more about how
>> the context for each entity is determined. The map reduce job that
>> generates the stats used by spotlight extracts the surrounding words
>> (according to a window and other constraints) of each link to an entity and
>> counts them, which means that heavily linked entities have a larger context
>> than no so frequently linked ones. This creates a heavy bias for
>> disambiguating certain entities, hence a case where smoothing might be a
>> good call.
>>
>>
>>
>>>
>>>
>>> Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching
>>> ):
>>>
>>> *How to deal with linguistic variation: lowercase/uppercase surface
>>> forms, determiners, accents, unicode, in a way such that the right
>>> generalizations can be made and some form of probabilistic structured can
>>> be determined in a principled way?*
>>> For dealing with linguistic variations we can calculate lexical
>>> translation probability from all probable name mentions to entities in KB
>>> as shown in Entity Name Model in [2].
>>>
>>> *Improve the memory footprint of the stores that hold the surface forms
>>> and their associated entities.*
>>> In what respect we are planning to improve footprints whether in terms
>>> of space or association or something else?
>>>
>>> For this project I have a couple of questions in mind:
>>> 1) Are we planning to improve the same model that we are using in
>>> dbpedia-spotlight for entity linking?
>>>
>>
>> Yes
>>
>>
>

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-08 Thread Thiago Galery
>>> Hi all,
>>>
>>> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
>>> should admit one thing that every idea is more interesting than the
>>> previous one. While I was looking out for ideas that interests me I found
>>> following ideas most fascinating and I wish I could work on all of them but
>>> unfortunately I couldn't:
>>>
>>> 1) 5.1 Fact Extraction from Wikipedia Text
>>>
>>> 2) 5.9 Keyword Search on DBpedia Data
>>>
>>> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>>>
>>> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>>>
>>> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>>>
>>> But in all these I found a couple of ideas interlinked, in other words
>>> one solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
>>> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
>>> raw text to DBpedia entities so as to understand raw text and disambiguate
>>> senses or entities. So if we can address these two tasks efficiently then
>>> we can solve problems associated with these three ideas.
>>>
>>> Following are some methods which were there in the research papers
>>> mentioned in references of these ideas.
>>>
>>> 1) FrameNet: Identify frames (indicating a particular type of situation
>>> along with its participants, i.e. task, doer and props), and then identify
>>> Logical Units, and their associated Frame Elements by using models trained
>>> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
>>> Labeling.
>>>
>>> 2) Babelfy: Using a wide semantic network, encoding structural and
>>> lexical information of both type encyclopedic and lexicographic like
>>> Wikipedia and WordNet resp., we can also accomplish our tasks (EL and WSD).
>>> In this a graphical method along with some heuristics is used to extract
>>> out the most relevant meaning from the text.
>>>
>>> 3) Word2vec / Glove - Methods for designing word vectors based on the
>>> context. These are primarily employed for WSD.
>>>
>>> Moreover if those problems are solved then we can address keyword search
>>> (5.9) and Confidence Scoring (5.19) effectively as both require association
>>> of entities to the raw text which will provide concerned entity and its
>>> attributes to search with and the confidence score.
>>>
>>> So I would like to work on 5.16 or 5.17 which will encompass those two
>>> tasks (EL and WSD) and for this I would like to ask which method will be
>>> the best for these two tasks? According to me it is the babelfy method
>>> which will be appropriate for both of these tasks.
>>>
>>> Thanks,
>>> Abhishek Gupta
>>> On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:
>>>
>>>> Hi Abishek, if you are interested in contributing to any DBpedia
>>>> project or participating in Gsoc this year it might be a good idea to take
>>>> a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This
>>>> might help you to specify how/where you can contribute. Hope this helps,
>>>> Thiago
>>>>
>>>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>>>> extracted Named Entities from raw text to populate knowledge base with new
>>>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>>>> familiar with programming languages like C, C++ and Java primarily.
>>>>>
>>>>> So I presume that I can contribute a lot towards extracting structured
>>>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>>>> primary goal.
>>>>>
>>>>> So can anyone please help me out where to start from so as to
>>>>> contribute towards this?
>>>>>
>>>>> Regards
>>>>> Abhishek Gupta
>>>>>
>>>>>
>>>>> -

Re: [Dbpedia-gsoc] DBpedia Spotlight – Better Tools for Model Creation - Ideas

2015-03-08 Thread Thiago Galery
Hi Naveen, in the wiki you can find the papers from the spotlight community
that might shed light into that. In general terms, spotlight implements an
entity mention model for entity linking and a very helpful paper for
understanding that is this one
https://aclweb.org/anthology/P/P11/P11-1095.pdf .
All the best,
Thiago

On Sun, Mar 8, 2015 at 11:37 AM, Naveen Madhire 
wrote:

> Hi David,
>
>
> I cloned the latest code from dbspotlight and did a build using Intellij.
> I ran few examples present here
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Run-from-a-JAR
>
> However, I still didn't get the gist of how Spotlight works.
>
> I am going to play around the spotlight editor and see how it works.
>
> Do you have any more examples related to Spotlight.
>
>
>
> Thanks,
> Naveen
>
> On Fri, Mar 6, 2015 at 7:05 PM, David Przybilla 
> wrote:
>
>> Hi Naveen,
>>
>> In order to create a Spotlight Model we have to massage the Wikipedia
>> dump in order get some statistics out of it.
>> Those statistics include the probability of seeing a surface form,
>> creating a context vector for each entity..etc.
>>
>> - It seems that the current script to do generate those models made on
>> pig is broken.  Check the issues below
>>
>> - It seems there are other projects who could benefit from this framework
>> if done properly
>>
>> - Spark is a good alternative, given that is easier to model map/reduce
>> problems but also because it is fast.
>>
>> A good start would be :
>>
>>  1. Play a bit with spotlight ( http://dbpedia-spotlight.github.io/demo/
>> )
>>  2. Check the warm up tasks
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
>>
>>  3. Compile and run spotlight locally
>>  4. Understand what are  stores  and what values are inside them
>>  5. Take a look at the script generating the current model:
>> https://github.com/dbpedia-spotlight/pignlproc  .
>>
>> Issues:
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/329
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/321
>>
>>
>> On Fri, Mar 6, 2015 at 8:59 PM, Naveen Madhire 
>> wrote:
>>
>>> Hi Team,
>>>
>>>
>>> I am currently pursuing Masters in Data Science from Indiana University.
>>> I am very much interested in participating in this years GSOC and the idea
>>> listed on DBPedia's website caught my eye as I am confortable in Apache
>>> Spark, Entity linking and JAVA.
>>>
>>> DBpedia Spotlight – Better Tools for Model Creation
>>>
>>> I don't see any discussion happening in the archives.
>>>
>>> If possible can anyone share any references to look into and any details
>>> which will help me to understand the current project in detail.
>>>
>>> Please let me know.
>>>
>>>
>>> Thanks,
>>> Naveen M
>>>
>>>
>>> --
>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>> sponsored
>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>> for all
>>> things parallel software development, from weekly thought leadership
>>> blogs to
>>> news, videos, case studies, tutorials and more. Take a look and join the
>>> conversation now. http://goparallel.sourceforge.net/
>>> ___
>>> Dbpedia-gsoc mailing list
>>> Dbpedia-gsoc@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] GSOC Aspirant

2015-03-08 Thread Thiago Galery
Hi Juhi, here are some links that might be relevant to you :

https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/291

https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/298

They are more related to the surface form matching, hope it helps.




On Sat, Mar 7, 2015 at 8:30 AM, JUHI TANDON  wrote:

> Hello Thiago,
>
> Thanks for the reply.
> Could you please provide the links of discussion ?  I found these pages
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks
> and https://github.com/dbpedia-spotlight/dbpedia-spotlight/issues/ but
> nothing in particular relted to these two projects.
>
> Regards,
> Juhi
>
> On Thu, Mar 5, 2015 at 2:36 AM, Thiago Galery  wrote:
>
>> Hi juhi, as a follow up to David,  I highly recommend you taking a look
>> at issues and PR pages of dbpedia spotlight's github repo. There's some
>> detailed discussion on these two issues.
>> All the best, Thiago.
>> On 4 Mar 2015 10:09, "David Przybilla"  wrote:
>>
>>> Hi Juhi,
>>>
>>> Any particular interest among those two?
>>> In the description we referenced a few papers which could be used as
>>> potential ideas.
>>>
>>> DBpedia Spotlight – Better Surface Form Matching:
>>>
>>> We are currently having some problems with linguistic variations. One of
>>> the ideas (but it is not fixed so feel free to bring yours as well :] )
>>> could be to use the superstring matching from the Babelfy paper. But It
>>> would also be nice to mix it with some of the methods described in: [1]
>>>
>>>
>>> DBpedia Spotlight – Better Context Vectors:
>>>
>>> The quality of the context vectors seems a bit weird for some entities.
>>> Values are not normalised, among other issues.
>>> The aim here is to improve context vectors/disambiguation.
>>>
>>> Some of the ideas could be: use glove/word2vec entity vectors.
>>> But also maybe bringing some ideas from discourse parsing like the ones
>>> described in  [2]
>>>
>>>
>>> A good start if you are interested would probably be:
>>>
>>>  - taking a quick look at the mentioned papers to get some ideas
>>>  - play a bit with spotlight ( demo
>>> http://dbpedia-spotlight.github.io/demo/ )
>>>  - try to set it up locally, (compile it, run it)
>>>  - trying to understand how the spotlight stores work. Probably the
>>> paper describing how spotlight works should be a good overview [3]
>>>
>>> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>> [2] http://www.aclweb.org/anthology/C14-1213
>>> [3] http://blog.semantic-web.at/wp-content/uploads/2011/09/p1_mendes.pdf
>>>
>>> On Wed, Mar 4, 2015 at 9:23 AM, Marco Fossati 
>>> wrote:
>>>
>>>> Hi Juhi,
>>>>
>>>> you should send all your inquiries in the mailing list (in CC).
>>>> Please check out the following page with our project ideas:
>>>> http://wiki.dbpedia.org/gsoc2015/ideas
>>>>
>>>> Cheers!
>>>>
>>>> On 3/3/15 2:58 PM, JUHI TANDON wrote:
>>>> > Hello Marco,
>>>> >
>>>> > I am Juhi Tandon pursuing my major in Computational Linguistics from
>>>> > IIIT Hyderabad. I am an NLP enthusiast and as such I found the
>>>> projects
>>>> > these projects particularly interesting :
>>>> >
>>>> >
>>>> >   DBpedia Spotlight – Better Context Vectors and
>>>> >
>>>> >
>>>> >   DBpedia Spotlight – Better Surface Form Matching
>>>> >
>>>> > I would like to contribute to one of these projects as a part of GSOC
>>>> > 2015 Program. If the mentors can please provide some insights on where
>>>> > to begin from.
>>>> >
>>>> > Thanks and Regards,
>>>> >
>>>> > Juhi
>>>>
>>>> --
>>>> Marco Fossati
>>>> http://about.me/marco.fossati
>>>> Twitter: @hjfocs
>>>> Skype: hell_j
>>>>
>>>>
>>>> --
>>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>>> sponsored
>>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>>> for all
>>>> things parallel software development, from we

Re: [Dbpedia-gsoc] GSoC 2015 Introduction

2015-03-08 Thread Thiago Galery
Hi Robert, I would advise taking a look at Marco's response to another
prospective student. He points to these links for a summary of a similar
project in 2014


-idea: http://wiki.dbpedia.org/gsoc2014/ideas#h359-11
-proposal:
https://docs.google.com/document/d/16lAqKLAsAGQW0cp9SA0Egb1vlb6mPCcHYezVN-zB870/edit?pli=1
-stuff

done:
https://github.com/dbpedia/extraction-framework/wiki/GSoC-2014-Progress-Sergey-Skovorodkin


On Fri, Mar 6, 2015 at 12:04 PM,  wrote:

> Hello everybody,
>
> first off I'd like to introduce myself . I'm Robert, a current Masters
> student at the Mannheim University. I'm studying Business Informatics
> and pursuing
> the Data and Web Science Specialization Track. One of my major
> interests lies in
> Data Mining and I constantly complement my studies with Data Mining
> related online
> courses (MOOCs) during my free time. Alongside my studies I'm also
> employed as a
> student researcher at the Data and Web Science research group [1] under the
> supervision of Prof. Bizer. You will find many professors mentioned in
> many of the
> papers you suggest as a starting point. A major part of the research
> is particularly
> dedicated at Open Linked Data, hence the education is close-knit with
> examples
> and from research projects.
>
> Furthermore, during one of my previous internships I have been involed
> in building
> an Active Learning system for Named Entity Recognition which has also
> enhanced my
> experience within this field. The first time I got in touch with NLP
> and Machine Learning
> was during my Bachelor Thesis that concerned with the classification
> of Scientific Papers.
>
> Now coming to the GSoC project:
>
> My first priority would be to work on "5.7. Reverse Engineering and
> Aligning Freebase
> with DBpedia." I have a working knowledge of Sparql and the Freebase
> MQL query language
> if needed. During my prior semester I have used DBPedia and Freebase
> to perform web
> data integration in a closed domain. So I'm aware of schema
> integration and schema matching
> procedures, which I think qualifies me along with my programming
> experience fairly well.
> After digging into the proposal of the project there are some
> uncertainties that aroused.
> In the descriptin you mention the introduction of new properties and
> classes if needed.
> Your first reference [2] concerns mainly with the reduction/fusion of
> closely related
> or equivalent properties.
>
> - Can you give me an intuition of a situation where a need for a new class
> or
> property would arise?
>
> - Can you also please give an example of tools that are based on
> freebase and that
> should be easily migrated to DBpedia?
>
> - Speaking of the current approaches of mapping classes and
> properties, is there any
> work currently going on that deal with hierarchies of subjects and objects?
>
> - Related to [2], do S1 and O1 represent actual subjects and objects
> or rdf:type classes
> of S1 and O1? I think one problem could (at least partially) solve the
> other, namely
> using a trustful class mapping could assist in working out equivalent
> property mappings
> and vice versa.
>
> I would be available full-time during the time period of GSoC and it comes
> naturally for me that I get myself into the latest research prior the start
> of the GSoC period.
>
> - Can you please advise me what would be the next step?
>
> - The project mentioned above is only one of my interests given your
> proposals. Do I
> have to elaborate my interest to my second and third priority in a
> similar way?
>
> Best regads
>
> Robert
>
> [1] http://dws.informatik.uni-mannheim.de/en/home/
> [2] http://wiki.knoesis.org/index.php/Property_Alignment
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sou

Re: [Dbpedia-gsoc] GSOC Aspirant

2015-03-04 Thread Thiago Galery
Hi juhi, as a follow up to David,  I highly recommend you taking a look at
issues and PR pages of dbpedia spotlight's github repo. There's some
detailed discussion on these two issues.
All the best, Thiago.
On 4 Mar 2015 10:09, "David Przybilla"  wrote:

> Hi Juhi,
>
> Any particular interest among those two?
> In the description we referenced a few papers which could be used as
> potential ideas.
>
> DBpedia Spotlight – Better Surface Form Matching:
>
> We are currently having some problems with linguistic variations. One of
> the ideas (but it is not fixed so feel free to bring yours as well :] )
> could be to use the superstring matching from the Babelfy paper. But It
> would also be nice to mix it with some of the methods described in: [1]
>
>
> DBpedia Spotlight – Better Context Vectors:
>
> The quality of the context vectors seems a bit weird for some entities.
> Values are not normalised, among other issues.
> The aim here is to improve context vectors/disambiguation.
>
> Some of the ideas could be: use glove/word2vec entity vectors.
> But also maybe bringing some ideas from discourse parsing like the ones
> described in  [2]
>
>
> A good start if you are interested would probably be:
>
>  - taking a quick look at the mentioned papers to get some ideas
>  - play a bit with spotlight ( demo
> http://dbpedia-spotlight.github.io/demo/ )
>  - try to set it up locally, (compile it, run it)
>  - trying to understand how the spotlight stores work. Probably the paper
> describing how spotlight works should be a good overview [3]
>
> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
> [2] http://www.aclweb.org/anthology/C14-1213
> [3] http://blog.semantic-web.at/wp-content/uploads/2011/09/p1_mendes.pdf
>
> On Wed, Mar 4, 2015 at 9:23 AM, Marco Fossati 
> wrote:
>
>> Hi Juhi,
>>
>> you should send all your inquiries in the mailing list (in CC).
>> Please check out the following page with our project ideas:
>> http://wiki.dbpedia.org/gsoc2015/ideas
>>
>> Cheers!
>>
>> On 3/3/15 2:58 PM, JUHI TANDON wrote:
>> > Hello Marco,
>> >
>> > I am Juhi Tandon pursuing my major in Computational Linguistics from
>> > IIIT Hyderabad. I am an NLP enthusiast and as such I found the projects
>> > these projects particularly interesting :
>> >
>> >
>> >   DBpedia Spotlight – Better Context Vectors and
>> >
>> >
>> >   DBpedia Spotlight – Better Surface Form Matching
>> >
>> > I would like to contribute to one of these projects as a part of GSOC
>> > 2015 Program. If the mentors can please provide some insights on where
>> > to begin from.
>> >
>> > Thanks and Regards,
>> >
>> > Juhi
>>
>> --
>> Marco Fossati
>> http://about.me/marco.fossati
>> Twitter: @hjfocs
>> Skype: hell_j
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> ___
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc