[Dbpedia-gsoc] Contribute to Dbpedia

2016-01-16 Thread Satrap Rathore
Hello,

I am Satrap Rathore,a Computer Science and Engineering undergraduate from NIT
Durgapur(India). Recently I have started working on a project related
to Machine Learning
 in which I have tried to analyse stoppage pattern of Public Bus
Transport GPS Traces in developing regions.
I find machine learning algorithms very interesting.Hence I am
inclined to work in this area. Besides this I am also
familiar with programming languages like C, C++,SQL and Python primarily.

It would really be grateful to get a chance to contribute to dbpedia
and help me get started.
   Waiting for a reply.

Regards
Satrap
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-17 Thread Joachim Daiber
Hi Abhishek, Thiago,

please also note that

http://spotlight.dbpedia.org/rest/annotate


does not run the current statistical version of Spotlight but the old
Lucene version. You can check the current statistical version via the demo
[1] or the endpoint URL at the bottom of that page.

We should change that but I don't have access to that server, unfortunately.

Best,
Jo

[1] http://dbpedia-spotlight.github.io/demo/



On Tue, Mar 17, 2015 at 1:10 PM, Abhishek Gupta  wrote:

> Hi Thiago,
>
> Sorry for the delay!
> I have set up the spotlight server and it is running perfectly fine but
> with minimal settings. After this set up I played with spotIight server
> during which I came across some discrepancies as follows:
>
> Example taken:
> http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
> 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
> the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
> Reich (1933–45). Berlin in the 1920s was the third largest municipality in
> the world. In 1990 German reunification took place in whole Germany in
> which the city regained its status as the capital of Germany.
>
> 1) If we run this we annotate "13th Century" to "
> http://dbpedia.org/page/19th_century";. This might be happening because
> the context is very much from 19th century and moreover in "13th Century"
> and "19th Century" there is minimal syntactic difference (one letter).
> But I am not sure whether this is good or bad.
> In my opinion if we have an entity in our store (
> http://dbpedia.org/page/13th_century) which is perfectly matching with
> surface form in raw text ("13th Century") we should have annotated SF to
> the entity.
> And same might be the case with "Germany" which is associated to "History
> of Germany " not "Germany
> ".
>
> 2) We are spotting "place" and associating it with "Portland Place
> ", maybe due to stemming SF.
> And even "Location (geography)
> " is not the correct entity
> type for this. This is because we are not able to detect the sense of the
> word "place" itself. So for that we may have to use word senses like from
> Wordnet etc.
>
> 3) We are detecting ". Berlin" as a surface form. But I don't came to
> know where this SF comes from. And I suspect this SF doesn't come from the
> Wikipedia.
>
> 4) We spotted "capital of Germany" but I didn't get any candidates if we
> run for "candidates" instead of "annotate".
>
> 5) We are able to spot "1920s" as a surface form but not "1920".
>
> Few more questions:
> 1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
> in raw text? Because in the above link I found "documented" (a word not a
> noun or entity) annotated to "http://dbpedia.org/resource/Document";.
>
> 2) Are we using surface forms to deal with only syntactic references (e.g.
> surface form "municipality" referring to "Municipality
> " or "Metropolitan_municipality
> " or "
> Municipalities_of_Mexico
> ") or both, syntactic
> and semantic references (e.g. aliases like "Third Reich" referring to "Nazi
> Germany ")?
>
> I am working on generating extra possible surface forms from
> a canonical surface form or the entity itself to deal with unseen SF
> association problems.
> I have also started working on my proposal will also submit it soon.
>
> Thanks,
> Abhishek
>
> On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery  wrote:
>
>> Hi Abhishek, thanks for the contribution. Your suggestions are pretty
>> much aligned with what we where thinking in any event, and the initial plan
>> seems good.
>> On the assumption that there's some code that generates extra possible
>> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
>> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
>> literature on Machine Translation on how to establish some score for the
>> surface form. That is, if you spot 'M Jordan' on the text, what is the
>> probability of it being a translation of the canonical name 'Michael
>> Jordan' .  If there's a simple way to implement this, we could try to get
>> the raw data with counts, generate some extra sfs in a principle manner and
>> use that to calculate probabilities. Still for the moment, I'd focus on
>> setting the spotlight server up and play with the warm up tasks.
>> Thanks for the good work,
>> Thiago
>>
>>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Medi

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-17 Thread Abhishek Gupta
Hi Thiago,

Sorry for the delay!
I have set up the spotlight server and it is running perfectly fine but
with minimal settings. After this set up I played with spotIight server
during which I came across some discrepancies as follows:

Example taken:
http://spotlight.dbpedia.org/rest/annotate?text=First documented in the
13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918),
the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third
Reich (1933–45). Berlin in the 1920s was the third largest municipality in
the world. In 1990 German reunification took place in whole Germany in which
 the city regained its status as the capital of Germany.

1) If we run this we annotate "13th Century" to "
http://dbpedia.org/page/19th_century";. This might be happening because the
context is very much from 19th century and moreover in "13th Century" and "19th
Century" there is minimal syntactic difference (one letter). But I am not
sure whether this is good or bad.
In my opinion if we have an entity in our store (
http://dbpedia.org/page/13th_century) which is perfectly matching with
surface form in raw text ("13th Century") we should have annotated SF to
the entity.
And same might be the case with "Germany" which is associated to "History
of Germany " not "Germany
".

2) We are spotting "place" and associating it with "Portland Place
", maybe due to stemming SF.
And even "Location (geography)
" is not the correct entity
type for this. This is because we are not able to detect the sense of the
word "place" itself. So for that we may have to use word senses like from
Wordnet etc.

3) We are detecting ". Berlin" as a surface form. But I don't came to know
where this SF comes from. And I suspect this SF doesn't come from the
Wikipedia.

4) We spotted "capital of Germany" but I didn't get any candidates if we
run for "candidates" instead of "annotate".

5) We are able to spot "1920s" as a surface form but not "1920".

Few more questions:
1) Are we trying to annotate every word, noun or entity(e.g. proper noun)
in raw text? Because in the above link I found "documented" (a word not a
noun or entity) annotated to "http://dbpedia.org/resource/Document";.

2) Are we using surface forms to deal with only syntactic references (e.g.
surface form "municipality" referring to "Municipality
" or "Metropolitan_municipality
" or "
Municipalities_of_Mexico ")
or both, syntactic and semantic references (e.g. aliases like "Third Reich"
referring to "Nazi Germany ")?

I am working on generating extra possible surface forms from
a canonical surface form or the entity itself to deal with unseen SF
association problems.
I have also started working on my proposal will also submit it soon.

Thanks,
Abhishek

On Thu, Mar 12, 2015 at 8:20 PM, Thiago Galery  wrote:

> Hi Abhishek, thanks for the contribution. Your suggestions are pretty much
> aligned with what we where thinking in any event, and the initial plan
> seems good.
> On the assumption that there's some code that generates extra possible
> surface forms from a cannonical surface form, like your 'Michael Jordan' ->
> 'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
> literature on Machine Translation on how to establish some score for the
> surface form. That is, if you spot 'M Jordan' on the text, what is the
> probability of it being a translation of the canonical name 'Michael
> Jordan' .  If there's a simple way to implement this, we could try to get
> the raw data with counts, generate some extra sfs in a principle manner and
> use that to calculate probabilities. Still for the moment, I'd focus on
> setting the spotlight server up and play with the warm up tasks.
> Thanks for the good work,
> Thiago
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-12 Thread Thiago Galery
Hi Abhishek, thanks for the contribution. Your suggestions are pretty much
aligned with what we where thinking in any event, and the initial plan
seems good.
On the assumption that there's some code that generates extra possible
surface forms from a cannonical surface form, like your 'Michael Jordan' ->
'M. Jordan', 'Jordan' and so on example, it would be worth looking in the
literature on Machine Translation on how to establish some score for the
surface form. That is, if you spot 'M Jordan' on the text, what is the
probability of it being a translation of the canonical name 'Michael
Jordan' .  If there's a simple way to implement this, we could try to get
the raw data with counts, generate some extra sfs in a principle manner and
use that to calculate probabilities. Still for the moment, I'd focus on
setting the spotlight server up and play with the warm up tasks.
Thanks for the good work,
Thiago
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-12 Thread David Przybilla
Hi Abhishek,

Those sounds like good ideas. Please check my answers below:

On Wed, Mar 11, 2015 at 2:34 PM, Thiago Galery  wrote:

> Thanks for your input Abhishek! Here are some thoughts:
>
> On Wed, Mar 11, 2015 at 7:40 AM, Abhishek Gupta  wrote:
>
>> Hi all,
>>
>> After looking out pr:
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/298, which I
>> assuming Thiago is referring to, some questions came into my mind to
>> clarify the context:
>>
>
> Yeah, this is indeed the PR I had in mind.
>
>
>> Am I right with these two observations?
>> 1) We are creating SF store for fast probable candidate entities
>> extraction
>>
>
> More or less, the SF store simply stores the association of a given
> surface form to a set of candidate entities. It's quite a common thing in
> entity linking systems, whether having it as a 'store' or some other data
> structure varies a lot though.
>
>
>> 2) We are thinking of stemming so as to solve the problem of finding
>> probable candidates in case we don't have earlier information of a surface
>> form attaching to a particular entity like 'The giants' associated to a
>> candidate X, but when a new surface form 'giant' appears it will not
>> recognize it because 'giants' was not used earlier in the context of X.
>>
>
> Yeah, simply put: 'giant' can be associated with a set of candidates that
> differs from that associated to 'giant'.
>
>
>>
>> For handling the above issue I would like to state a couple of approaches:
>>
>> *Approach 1: This approach have high Space Complexity*
>>
>> From my perspective, when we create a SF storage it should satisfy
>> following properties:
>> 1) Extracting at least the correct surface form in almost all cases
>> disregarding how many SFs we are extracting. by keeping the Surface Form
>> selection loose.
>> 2) We should be able to identify even that SF which we have not seen in
>> our data e.g. HD -> Harley Davidson
>>
>> This situation might lead us to quite a large number candidate entities
>> and hence difficult disambiguation but this is the cost we might have to
>> pay for detecting unseen instances. But disambiguation might be able to
>> handle this because the context of entities will be quite different in most
>> of the cases.
>>
>> Perspective solutions to satisfy above properties:
>> 1) Keeping SF selection hence candidate selection algorithm loose.
>> 2) For this I would like to introduce another design besides stemming use
>> following pipeline:
>> a) Convert every entity using a function and find
>> probableSurfaceForms_1 (pSF1) like below:
>>  (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael
>> Jeffery Jordan"
>>  (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
>>  (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
>>  (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
>> Jordan", "M." (like Mr. M.)
>> b) Convert pSF1 to pSF2:
>>  Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
>>  Step2: Convert to lowercase
>> c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm
>> (sFF)
>> d) Store sFFs with indexes of corresponding entities. (We are not
>> storing pSF1s or pSF2s)
>>
>
>
One case which commonly occurs is combination of upper and lower cases i.e:
"Michael jordan" but yeah :) this is a good direction. We partially tried
to implement something inthat direction.


> We tried to do something like this in the PR you mentioned, but in a way
> less systematic way. So your suggestions are welcome.
> One thing that you need to worry about though concerns step (b). Spotlight
> is a bit language agnostic, so we would need to add information about
> determiners, prepositions and so on for a series of language. This is not
> very complicated but worth keeping in mind.
>
>
>
>
>>
>> Now let's approach from raw text - After spotting a sequence of tokens
>> (SoT) in raw text, which might be an entity reference, we should pass this
>> sequence of tokens through the same function that I mentioned in step 2
>> (part (a) and (b)) and then match the output with stored sFFs. And then we
>> can find our concerning entities. We can then also calculate the relevance
>> between sequence of tokes and sFFs corresponding entities using a function
>> like Levenshtein Distance.
>>
>> I am doing some additional steps so as to address following rare
>> situation which we don't have in our data:
>> "My *HD* is world's most amazing motorcycles I have ever seen."
>> This situation is quite unlike as there might not be any reference from
>> HD to Harley Davidson but by context we can infer that it might be Harley
>> Davidson using the above approach.
>>
>>
> I'm not sure I get this, but we would definitely need to review the way we
> score the association between a surface form and a given candidate. In your
> example you rely on the contextual score, but it's very important to keep
> in mind that in order for the loose matching 

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-11 Thread Abhishek Gupta
Hi Thiago,

I have addressed your concerns below:


>>> For handling the above issue I would like to state a couple of
>>> approaches:
>>>
>>> *Approach 1: This approach have high Space Complexity*
>>>
>>> From my perspective, when we create a SF storage it should satisfy
>>> following properties:
>>> 1) Extracting at least the correct surface form in almost all cases
>>> disregarding how many SFs we are extracting. by keeping the Surface Form
>>> selection loose.
>>> 2) We should be able to identify even that SF which we have not seen in
>>> our data e.g. HD -> Harley Davidson
>>>
>>> This situation might lead us to quite a large number candidate entities
>>> and hence difficult disambiguation but this is the cost we might have to
>>> pay for detecting unseen instances. But disambiguation might be able to
>>> handle this because the context of entities will be quite different in most
>>> of the cases.
>>>
>>> Perspective solutions to satisfy above properties:
>>> 1) Keeping SF selection hence candidate selection algorithm loose.
>>> 2) For this I would like to introduce another design besides stemming
>>> use following pipeline:
>>> a) Convert every entity using a function and find
>>> probableSurfaceForms_1 (pSF1) like below:
>>>  (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael
>>> Jeffery Jordan"
>>>  (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
>>>  (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
>>>  (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
>>> Jordan", "M." (like Mr. M.)
>>> b) Convert pSF1 to pSF2:
>>>  Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
>>>  Step2: Convert to lowercase
>>> c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm
>>> (sFF)
>>> d) Store sFFs with indexes of corresponding entities. (We are not
>>> storing pSF1s or pSF2s)
>>>
>>
>>
> One case which commonly occurs is combination of upper and lower cases
> i.e: "Michael jordan" but yeah :) this is a good direction. We partially
> tried to implement something inthat direction.
>
>
>> We tried to do something like this in the PR you mentioned, but in a way
>> less systematic way. So your suggestions are welcome.
>> One thing that you need to worry about though concerns step (b).
>> Spotlight is a bit language agnostic, so we would need to add information
>> about determiners, prepositions and so on for a series of language. This is
>> not very complicated but worth keeping in mind.
>>
>

I put the step (b) for the same purpose that David point out. In data we
might have instances like "M. jordan", "MJ", "Mr. m.j. jordan" which might
be rare but we have to take care of them.

As per Thiago's concern there might be problem in one case that if two or
more entities will be converted to same sFF and entities have similar
context. This is the worst case that we might not be able to handle.
Otherwise no matter any number of arbitrary candidate instances we might
get, our disambiguator should take care of that using the context. Even if
the case in which one entity in English and one entity in say French both
resulting into same sFF, as a result of step 2 operations. In this case we
mark both as candidate entities and then our disambiguator will choose the
correct one based on context.



>
>>> Now let's approach from raw text - After spotting a sequence of tokens
>>> (SoT) in raw text, which might be an entity reference, we should pass this
>>> sequence of tokens through the same function that I mentioned in step 2
>>> (part (a) and (b)) and then match the output with stored sFFs. And then we
>>> can find our concerning entities. We can then also calculate the relevance
>>> between sequence of tokes and sFFs corresponding entities using a function
>>> like Levenshtein Distance.
>>>
>>> I am doing some additional steps so as to address following rare
>>> situation which we don't have in our data:
>>> "My *HD* is world's most amazing motorcycles I have ever seen."
>>> This situation is quite unlike as there might not be any reference from
>>> HD to Harley Davidson but by context we can infer that it might be Harley
>>> Davidson using the above approach.
>>>
>>>
>> I'm not sure I get this, but we would definitely need to review the way
>> we score the association between a surface form and a given candidate. In
>> your example you rely on the contextual score, but it's very important to
>> keep in mind that in order for the loose matching approach to work, we
>> would need to do some improvements on the context store as well. This is
>> why there's another gsoc Idea related to that.
>>
>
I wanted to explain how we will process raw text. Let's take the example
below. After we spot a sequence of tokens ("Mr. Michael J. Jordan") using
our spotter we have to pass it through operations in step 2. Then we will
check whether our result (Michael J Jordan) is present in our sFF list or
not.

"Mr. Michael J. Jord

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-11 Thread Thiago Galery
Thanks for your input Abhishek! Here are some thoughts:

On Wed, Mar 11, 2015 at 7:40 AM, Abhishek Gupta  wrote:

> Hi all,
>
> After looking out pr:
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/298, which I
> assuming Thiago is referring to, some questions came into my mind to
> clarify the context:
>

Yeah, this is indeed the PR I had in mind.


> Am I right with these two observations?
> 1) We are creating SF store for fast probable candidate entities extraction
>

More or less, the SF store simply stores the association of a given surface
form to a set of candidate entities. It's quite a common thing in entity
linking systems, whether having it as a 'store' or some other data
structure varies a lot though.


> 2) We are thinking of stemming so as to solve the problem of finding
> probable candidates in case we don't have earlier information of a surface
> form attaching to a particular entity like 'The giants' associated to a
> candidate X, but when a new surface form 'giant' appears it will not
> recognize it because 'giants' was not used earlier in the context of X.
>

Yeah, simply put: 'giant' can be associated with a set of candidates that
differs from that associated to 'giant'.


>
> For handling the above issue I would like to state a couple of approaches:
>
> *Approach 1: This approach have high Space Complexity*
>
> From my perspective, when we create a SF storage it should satisfy
> following properties:
> 1) Extracting at least the correct surface form in almost all cases
> disregarding how many SFs we are extracting. by keeping the Surface Form
> selection loose.
> 2) We should be able to identify even that SF which we have not seen in
> our data e.g. HD -> Harley Davidson
>
> This situation might lead us to quite a large number candidate entities
> and hence difficult disambiguation but this is the cost we might have to
> pay for detecting unseen instances. But disambiguation might be able to
> handle this because the context of entities will be quite different in most
> of the cases.
>
> Perspective solutions to satisfy above properties:
> 1) Keeping SF selection hence candidate selection algorithm loose.
> 2) For this I would like to introduce another design besides stemming use
> following pipeline:
> a) Convert every entity using a function and find
> probableSurfaceForms_1 (pSF1) like below:
>  (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael Jeffery
> Jordan"
>  (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
>  (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
>  (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
> Jordan", "M." (like Mr. M.)
> b) Convert pSF1 to pSF2:
>  Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
>  Step2: Convert to lowercase
> c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm (sFF)
> d) Store sFFs with indexes of corresponding entities. (We are not
> storing pSF1s or pSF2s)
>

We tried to do something like this in the PR you mentioned, but in a way
less systematic way. So your suggestions are welcome.
One thing that you need to worry about though concerns step (b). Spotlight
is a bit language agnostic, so we would need to add information about
determiners, prepositions and so on for a series of language. This is not
very complicated but worth keeping in mind.




>
> Now let's approach from raw text - After spotting a sequence of tokens
> (SoT) in raw text, which might be an entity reference, we should pass this
> sequence of tokens through the same function that I mentioned in step 2
> (part (a) and (b)) and then match the output with stored sFFs. And then we
> can find our concerning entities. We can then also calculate the relevance
> between sequence of tokes and sFFs corresponding entities using a function
> like Levenshtein Distance.
>
> I am doing some additional steps so as to address following rare situation
> which we don't have in our data:
> "My *HD* is world's most amazing motorcycles I have ever seen."
> This situation is quite unlike as there might not be any reference from HD
> to Harley Davidson but by context we can infer that it might be Harley
> Davidson using the above approach.
>
>
I'm not sure I get this, but we would definitely need to review the way we
score the association between a surface form and a given candidate. In your
example you rely on the contextual score, but it's very important to keep
in mind that in order for the loose matching approach to work, we would
need to do some improvements on the context store as well. This is why
there's another gsoc Idea related to that.


> *Approach 2: This approach have might have high Time Complexity*
>
> Instead of finding candidate entities without using context we can use our
> context to some extent.
> 1) We locate our context in a connected entities graph using the context
> of sequence of tokens.
> 2) Find all entities linked to our contex

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-11 Thread Abhishek Gupta
Hi all,

After looking out pr:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/298, which I
assuming Thiago is referring to, some questions came into my mind to
clarify the context:

Am I right with these two observations?
1) We are creating SF store for fast probable candidate entities extraction
2) We are thinking of stemming so as to solve the problem of finding
probable candidates in case we don't have earlier information of a surface
form attaching to a particular entity like 'The giants' associated to a
candidate X, but when a new surface form 'giant' appears it will not
recognize it because 'giants' was not used earlier in the context of X.


For handling the above issue I would like to state a couple of approaches:

*Approach 1: This approach have high Space Complexity*

>From my perspective, when we create a SF storage it should satisfy
following properties:
1) Extracting at least the correct surface form in almost all cases
disregarding how many SFs we are extracting. by keeping the Surface Form
selection loose.
2) We should be able to identify even that SF which we have not seen in our
data e.g. HD -> Harley Davidson

This situation might lead us to quite a large number candidate entities and
hence difficult disambiguation but this is the cost we might have to pay
for detecting unseen instances. But disambiguation might be able to handle
this because the context of entities will be quite different in most of the
cases.

Perspective solutions to satisfy above properties:
1) Keeping SF selection hence candidate selection algorithm loose.
2) For this I would like to introduce another design besides stemming use
following pipeline:
a) Convert every entity using a function and find
probableSurfaceForms_1 (pSF1) like below:
 (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael Jeffery
Jordan"
 (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
 (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
 (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
Jordan", "M." (like Mr. M.)
b) Convert pSF1 to pSF2:
 Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
 Step2: Convert to lowercase
c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm (sFF)
d) Store sFFs with indexes of corresponding entities. (We are not
storing pSF1s or pSF2s)

Now let's approach from raw text - After spotting a sequence of tokens
(SoT) in raw text, which might be an entity reference, we should pass this
sequence of tokens through the same function that I mentioned in step 2
(part (a) and (b)) and then match the output with stored sFFs. And then we
can find our concerning entities. We can then also calculate the relevance
between sequence of tokes and sFFs corresponding entities using a function
like Levenshtein Distance.

I am doing some additional steps so as to address following rare situation
which we don't have in our data:
"My *HD* is world's most amazing motorcycles I have ever seen."
This situation is quite unlike as there might not be any reference from HD
to Harley Davidson but by context we can infer that it might be Harley
Davidson using the above approach.


*Approach 2: This approach have might have high Time Complexity*

Instead of finding candidate entities without using context we can use our
context to some extent.
1) We locate our context in a connected entities graph using the context of
sequence of tokens.
2) Find all entities linked to our context and they will be our candidate
entities using Babelfy approach.
3) Pass all the candidate entities to the function mentioned in step 2 of
Approach 1.
4) Pass SoT from the same function (part (a) and (b))
5) Score candidates using Levenshtein Distance

Actually in Approach 2 we are doing a bit of disambiguation in Step 1
itself which will reduce our count of sFFs.

Please review these ideas and provide your feedback.

Moreover I am trying to setting up the server on my PC itself which is
taking some time due to a 10Gb file. I will come up with results as soon as
I got some results. Till then I might follow up with some other warm-up
task which is related to project ideas 5.15 and 5.16.

Regards,
Abhishek

On Sun, Mar 8, 2015 at 11:47 PM, Thiago Galery  wrote:

> Hi Abhishek, here are some thoughts about some of your questions:
>
> I would like to ask a few questions:
>> 1) Are we designing these vectors to use in the disambiguation step of
>> Entity Linking (matching raw text entity to KB entity) or Is there any
>> other task we have in mind where these vectors can be employed?
>>
>
>
> The main focus would be disambiguation, but one could reuse the contextual
> score of the entity to determine how relevant that entity is for the text.
>
>
>
>> 2) At present which model is used for disambiguation in dbpedia-spotlight?
>>
>
>
> Correct me if I am wrong, but I think that disambiguation is done by
> cosine similarity (on term frequency) between the context

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-09 Thread David Przybilla
Hi Abhisek,

I guess you could try to implement the spotting/disambiguation on the same
step like the babelfy papers suggests.

Warm up tasks:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Warm-up-tasks


On Mon, Mar 9, 2015 at 8:04 AM, Axel Ngonga <
ngo...@informatik.uni-leipzig.de> wrote:

>  Hallo Abhishek,
>
> Cool that have you here! For the keyword search topic, please checkout
> * http://goo.gl/dPbP3F
> * http://dl.acm.org/citation.cfm?id=2488488
>
> Feel free to contact me for questions and/or a warm-up task.
>
> Best regards,
> Axel
>
> Hi all,
>
> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
> should admit one thing that every idea is more interesting than the
> previous one. While I was looking out for ideas that interests me I found
> following ideas most fascinating and I wish I could work on all of them but
> unfortunately I couldn't:
>
> 1) 5.1 Fact Extraction from Wikipedia Text
>
> 2) 5.9 Keyword Search on DBpedia Data
>
> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>
> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>
> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>
> But in all these I found a couple of ideas interlinked, in other words one
> solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
> raw text to DBpedia entities so as to understand raw text and disambiguate
> senses or entities. So if we can address these two tasks efficiently then
> we can solve problems associated with these three ideas.
>
> Following are some methods which were there in the research papers
> mentioned in references of these ideas.
>
> 1) FrameNet: Identify frames (indicating a particular type of situation
> along with its participants, i.e. task, doer and props), and then identify
> Logical Units, and their associated Frame Elements by using models trained
> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
> Labeling.
>
> 2) Babelfy: Using a wide semantic network, encoding structural and lexical
> information of both type encyclopedic and lexicographic like Wikipedia and
> WordNet resp., we can also accomplish our tasks (EL and WSD). In this a
> graphical method along with some heuristics is used to extract out the most
> relevant meaning from the text.
>
> 3) Word2vec / Glove - Methods for designing word vectors based on the
> context. These are primarily employed for WSD.
>
> Moreover if those problems are solved then we can address keyword search
> (5.9) and Confidence Scoring (5.19) effectively as both require association
> of entities to the raw text which will provide concerned entity and its
> attributes to search with and the confidence score.
>
> So I would like to work on 5.16 or 5.17 which will encompass those two
> tasks (EL and WSD) and for this I would like to ask which method will be
> the best for these two tasks? According to me it is the babelfy method
> which will be appropriate for both of these tasks.
>
> Thanks,
> Abhishek Gupta
> On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:
>
>>  Hi Abishek, if you are interested in contributing to any DBpedia
>> project or participating in Gsoc this year it might be a good idea to take
>> a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might
>> help you to specify how/where you can contribute. Hope this helps,
>>  Thiago
>>
>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta 
>> wrote:
>>
>>> Hi all,
>>>
>>>  I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>> extracted Named Entities from raw text to populate knowledge base with new
>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>> familiar with programming languages like C, C++ and Java primarily.
>>>
>>>  So I presume that I can contribute a lot towards extracting structured
>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>> primary goal.
>>>
>>>  So can anyone please help me out where to start from so as to
>>> contribute towards this?
>>>
>>>  Regards
>>>  Abhishek Gupta
>>>
>>>
>>> --
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>> ___
>>> Dbpedia-gsoc mailing list
>>> Dbpedia-gsoc@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>
>>>
>>
>
> --

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-09 Thread Axel Ngonga

Hallo Abhishek,

Cool that have you here! For the keyword search topic, please checkout
* http://goo.gl/dPbP3F
* http://dl.acm.org/citation.cfm?id=2488488

Feel free to contact me for questions and/or a warm-up task.

Best regards,
Axel


Hi all,

Recently I checked out the ideas list of DBpedia for GSoC 2015 and I 
should admit one thing that every idea is more interesting than the 
previous one. While I was looking out for ideas that interests me I 
found following ideas most fascinating and I wish I could work on all 
of them but unfortunately I couldn't:


1) 5.1 Fact Extraction from Wikipedia Text

2) 5.9 Keyword Search on DBpedia Data

3) 5.16 DBpedia Spotlight - Better Context Vectors

4) 5.17 DBpedia Spotlight - Better Surface form Matching

5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores

But in all these I found a couple of ideas interlinked, in other words 
one solution might leads to another. Like in 5.1, 5.16, 5.17 our 
primary problems are Entity Linking (EL) and Word Sense Disambiguation 
(WSD) from raw text to DBpedia entities so as to understand raw text 
and disambiguate senses or entities. So if we can address these two 
tasks efficiently then we can solve problems associated with these 
three ideas.


Following are some methods which were there in the research papers 
mentioned in references of these ideas.


1) FrameNet: Identify frames (indicating a particular type of 
situation along with its participants, i.e. task, doer and props), and 
then identify Logical Units, and their associated Frame Elements by 
using models trained primarily on crowd-sourced data. Primarily used 
for Automatic Semantic Role Labeling.


2) Babelfy: Using a wide semantic network, encoding structural and 
lexical information of both type encyclopedic and lexicographic like 
Wikipedia and WordNet resp., we can also accomplish our tasks (EL and 
WSD). In this a graphical method along with some heuristics is used to 
extract out the most relevant meaning from the text.


3) Word2vec / Glove - Methods for designing word vectors based on the 
context. These are primarily employed for WSD.


Moreover if those problems are solved then we can address keyword 
search (5.9) and Confidence Scoring (5.19) effectively as both require 
association of entities to the raw text which will provide concerned 
entity and its attributes to search with and the confidence score.


So I would like to work on 5.16 or 5.17 which will encompass those two 
tasks (EL and WSD) and for this I would like to ask which method will 
be the best for these two tasks? According to me it is the babelfy 
method which will be appropriate for both of these tasks.


Thanks,
Abhishek Gupta

On Feb 23, 2015 5:46 PM, "Thiago Galery" > wrote:


Hi Abishek, if you are interested in contributing to any DBpedia
project or participating in Gsoc this year it might be a good idea
to take a look at this page http://wiki.dbpedia.org/gsoc2015/ideas
. This might help you to specify how/where you can contribute.
Hope this helps,
Thiago

On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta mailto:a.gu...@gmail.com>> wrote:

Hi all,

I am Abhishek Gupta. I am a student of Electrical Engineering
from IIT Delhi. Recently I have worked on the projects related
to Machine Learning and Natural Language Processing (i.e.
Information Extraction) in which I extracted Named Entities
from raw text to populate knowledge base with new entities.
Hence I am inclined to work in this area. Besides this I am
also familiar with programming languages like C, C++ and Java
primarily.

So I presume that I can contribute a lot towards extracting
structured data from wikipedia which is one of the primary
step towards Dbpedia's primary goal.

So can anyone please help me out where to start from so as to
contribute towards this?

Regards
Abhishek Gupta


--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and
Dashboards
with Interactivity, Sharing, Native Excel Exports, App
Integration & more
Get technology previously reserved for billion-dollar
corporations, FREE

http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc




--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Sla

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-08 Thread Thiago Galery
Hi Abhishek, here are some thoughts about some of your questions:

I would like to ask a few questions:
> 1) Are we designing these vectors to use in the disambiguation step of
> Entity Linking (matching raw text entity to KB entity) or Is there any
> other task we have in mind where these vectors can be employed?
>


The main focus would be disambiguation, but one could reuse the contextual
score of the entity to determine how relevant that entity is for the text.



> 2) At present which model is used for disambiguation in dbpedia-spotlight?
>


Correct me if I am wrong, but I think that disambiguation is done by cosine
similarity (on term frequency) between the context surrounding the
extracted surface form and the context associated with each candidate
entity associated with that surface form.


> 3) Are we trying to focus on modelling context vectors for infrequent
> words primarily as there might not have enough information hence difficult
> to model?
>

The problem is not related to frequent words per se, but more about how the
context for each entity is determined. The map reduce job that generates
the stats used by spotlight extracts the surrounding words (according to a
window and other constraints) of each link to an entity and counts them,
which means that heavily linked entities have a larger context than no so
frequently linked ones. This creates a heavy bias for disambiguating
certain entities, hence a case where smoothing might be a good call.



>
>
> Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching):
>
> *How to deal with linguistic variation: lowercase/uppercase surface forms,
> determiners, accents, unicode, in a way such that the right generalizations
> can be made and some form of probabilistic structured can be determined in
> a principled way?*
> For dealing with linguistic variations we can calculate lexical
> translation probability from all probable name mentions to entities in KB
> as shown in Entity Name Model in [2].
>
> *Improve the memory footprint of the stores that hold the surface forms
> and their associated entities.*
> In what respect we are planning to improve footprints whether in terms of
> space or association or something else?
>
> For this project I have a couple of questions in mind:
> 1) Are we planning to improve the same model that we are using in
> dbpedia-spotlight for entity linking?
>

Yes


> 2) If not we can change the whole model itself to something else like:
> a) Generative Model [2]
> b) Discriminative Model [3]
> c) Graph Based [4] - Babelfy
> d) Probabilistic Graph Based
>

Incorporating something like (c) or (d) might be a good call, but might be
way bigger than one summer.



> 3) Why are we planning to store surface forms with associated entities
> instead of finding associated entities during disambiguation itself?
>

No sure what you mean by that.


> Besides this I would also like to know regarding warm-up task I have to do.
>

If you check the pull request page in spolight, @dav009 has a PR which he
claims to be a mere *Idea* but forces surface forms to be stemmed before
storing. Pulling from that branch, recompiling, running spotlight and
seeing some of the results would be a good start. You can also nag us on
that issue about ideas you might have after you understand the code.



>
> Thanks,
> Abhishek Gupta
>
> [1]
> https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
> [2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
> [3]
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
> [4] http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
> [5] http://www.aclweb.org/anthology/D11-1072
>
> On Tue, Mar 3, 2015 at 2:01 AM, David Przybilla 
> wrote:
>
>> Hi Abhisek,
>>
>> There is a lot of experimentation which can be done with both 5.16  and
>> 5.17.
>>
>> In my opinion the current problem is that the Surface Form(SF) matching
>> is a bit poor.
>> Mixing the Babelfy Superstring matching with other ideas to make SF
>> spotting better could be a great start.
>> You can also bring ideas from papers such as [1] in order to address more
>> linguistic variations.
>>
>> It's hard to debate which one is better, however you can mix ideas  i.e:
>> use superstring matching to greedy match more Surface forms with more
>> linguistic variations, while using word2vec in the disambiguation stage.
>>
>> Feel free to poke me if you would like to discuss in more detail :)
>>
>>
>> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>
>>
>>
>>
>>
>> On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta  wrote:
>>
>>> Hi all,
>>>
>>> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
>>> should admit one thing that every idea is more interesting than the
>>> previous one. While I was looking out for ideas that interests me I found
>>> following ideas most fascinating and I wish I could work on all of them but
>>> unfortunately I cou

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-08 Thread Abhishek Gupta
Hi all,

I would like to present my views to your questions regarding project ideas
5.15 (DBpedia Spotlight - Better Context Vectors), 5.16 (DBpedia Spotlight
- Better Surface form Matching) and would like to raise some also.

Regarding Project 5.15 (DBpedia Spotlight - Better Context Vectors):

*Does smoothing/pruning offer a significant improvement on Spotlight's
performance?*
Smoothing is always recommended whenever we employ a method which
incorporates counting of objects (words, bigrams, etc,) in any
probabilistic modelling of context vectors and there are a lot of them
which we can investigate from simple one like (add-one) Laplace smoothing
to advanced ones like Good-Turing, Kneser-Ney and back-off models which
were mentioned in NLP course on coursera by Stanford University. As
smoothing in most of the cases does improve test results but it may not be
very significant.
And as far as pruning of word vectors is concerned it may or may not
improve our results as it will depend on the task we are considering. We
usually find more dimensions producing better results but as shown on page
14 of word2vec NIPS slides [1] CBOW model with 300 dimensions is performing
better than Skip-gram model with 1000 dimensions. But one thing that we
must consider over here is that smaller dimension does save our time and
space.

*What distributional methods can be used to represent context (e.g.
word2vec / Glove)? Do they offer a significant performance improvement?*
For the representation of context we have following options:
1) Matrix Factorization Models - Latent Semantic Indexing, Latent Dirichlet
Allocation
2) Clustering Based Models - Brown Clustering (Brown et. al 1992), Exchange
Clustering (MarGn et al. 1998, Clark 2003)
3) Distributed Representation: word2vec (Continuous Skip-Gram Model,
Continuous Bag-of-Words Model).
4). Log Bi linear Model - Glove
In these all these I would prefer word2vec because we have Recursive Neural
Network based algorithm (Socher et. al 2014) for representing phrases i.e.
context itself in word vector space.

*Is there any other metric to intermediate the measured similarity between
entity candidates and the context around the mention?*
Following are some other Metric options that we can employ in matrix
factorization models:
1) Term-document - http://en.wikipedia.org/wiki/Latent_semantic_indexing
2) Term-term - HAL ((Lund and Burgess 1996), Entropy based COALS method
(Rohde et. al. 2006), PPMI based method (Bullinaria and Levy), Hellinger
PCA (Lebret and Collobert et. al 2014)
Almost all of these methods are mentioned in Glove with their comparison
with different datasets over different tasks.

I would like to ask a few questions:
1) Are we designing these vectors to use in the disambiguation step of
Entity Linking (matching raw text entity to KB entity) or Is there any
other task we have in mind where these vectors can be employed?
2) At present which model is used for disambiguation in dbpedia-spotlight?
3) Are we trying to focus on modelling context vectors for infrequent words
primarily as there might not have enough information hence difficult to
model?


Regarding Project 5.16 (DBpedia Spotlight - Better Surface form Matching):

*How to deal with linguistic variation: lowercase/uppercase surface forms,
determiners, accents, unicode, in a way such that the right generalizations
can be made and some form of probabilistic structured can be determined in
a principled way?*
For dealing with linguistic variations we can calculate lexical translation
probability from all probable name mentions to entities in KB as shown in
Entity Name Model in [2].

*Improve the memory footprint of the stores that hold the surface forms and
their associated entities.*
In what respect we are planning to improve footprints whether in terms of
space or association or something else?

For this project I have a couple of questions in mind:
1) Are we planning to improve the same model that we are using in
dbpedia-spotlight for entity linking?
2) If not we can change the whole model itself to something else like:
a) Generative Model [2]
b) Discriminative Model [3]
c) Graph Based [4] - Babelfy
d) Probabilistic Graph Based
3) Why are we planning to store surface forms with associated entities
instead of finding associated entities during disambiguation itself?

Besides this I would also like to know regarding warm-up task I have to do.

Thanks,
Abhishek Gupta

[1]
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
[2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
[3]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
[4] http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
[5] http://www.aclweb.org/anthology/D11-1072

On Tue, Mar 3, 2015 at 2:01 AM, David Przybilla 
wrote:

> Hi Abhisek,
>
> There is a lot of experimentation which can be done with both 5.16  and
> 5.17.
>
> In my opinion the current problem is tha

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-02 Thread David Przybilla
Hi Abhisek,

There is a lot of experimentation which can be done with both 5.16  and
5.17.

In my opinion the current problem is that the Surface Form(SF) matching is
a bit poor.
Mixing the Babelfy Superstring matching with other ideas to make SF
spotting better could be a great start.
You can also bring ideas from papers such as [1] in order to address more
linguistic variations.

It's hard to debate which one is better, however you can mix ideas  i.e:
use superstring matching to greedy match more Surface forms with more
linguistic variations, while using word2vec in the disambiguation stage.

Feel free to poke me if you would like to discuss in more detail :)


[1] https://aclweb.org/anthology/P/P11/P11-1095.pdf





On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta  wrote:

> Hi all,
>
> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
> should admit one thing that every idea is more interesting than the
> previous one. While I was looking out for ideas that interests me I found
> following ideas most fascinating and I wish I could work on all of them but
> unfortunately I couldn't:
>
> 1) 5.1 Fact Extraction from Wikipedia Text
>
> 2) 5.9 Keyword Search on DBpedia Data
>
> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>
> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>
> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>
> But in all these I found a couple of ideas interlinked, in other words one
> solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
> problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
> raw text to DBpedia entities so as to understand raw text and disambiguate
> senses or entities. So if we can address these two tasks efficiently then
> we can solve problems associated with these three ideas.
>
> Following are some methods which were there in the research papers
> mentioned in references of these ideas.
>
> 1) FrameNet: Identify frames (indicating a particular type of situation
> along with its participants, i.e. task, doer and props), and then identify
> Logical Units, and their associated Frame Elements by using models trained
> primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
> Labeling.
>
> 2) Babelfy: Using a wide semantic network, encoding structural and lexical
> information of both type encyclopedic and lexicographic like Wikipedia and
> WordNet resp., we can also accomplish our tasks (EL and WSD). In this a
> graphical method along with some heuristics is used to extract out the most
> relevant meaning from the text.
>
> 3) Word2vec / Glove - Methods for designing word vectors based on the
> context. These are primarily employed for WSD.
>
> Moreover if those problems are solved then we can address keyword search
> (5.9) and Confidence Scoring (5.19) effectively as both require association
> of entities to the raw text which will provide concerned entity and its
> attributes to search with and the confidence score.
>
> So I would like to work on 5.16 or 5.17 which will encompass those two
> tasks (EL and WSD) and for this I would like to ask which method will be
> the best for these two tasks? According to me it is the babelfy method
> which will be appropriate for both of these tasks.
>
> Thanks,
> Abhishek Gupta
> On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:
>
>> Hi Abishek, if you are interested in contributing to any DBpedia project
>> or participating in Gsoc this year it might be a good idea to take a look
>> at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might help
>> you to specify how/where you can contribute. Hope this helps,
>> Thiago
>>
>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>>> Delhi. Recently I have worked on the projects related to Machine Learning
>>> and Natural Language Processing (i.e. Information Extraction) in which I
>>> extracted Named Entities from raw text to populate knowledge base with new
>>> entities. Hence I am inclined to work in this area. Besides this I am also
>>> familiar with programming languages like C, C++ and Java primarily.
>>>
>>> So I presume that I can contribute a lot towards extracting structured
>>> data from wikipedia which is one of the primary step towards Dbpedia's
>>> primary goal.
>>>
>>> So can anyone please help me out where to start from so as to contribute
>>> towards this?
>>>
>>> Regards
>>> Abhishek Gupta
>>>
>>>
>>> --
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>> __

Re: [Dbpedia-gsoc] Contribute to DbPedia

2015-03-02 Thread Abhishek Gupta
Hi all,

Recently I checked out the ideas list of DBpedia for GSoC 2015 and I should
admit one thing that every idea is more interesting than the previous one.
While I was looking out for ideas that interests me I found following ideas
most fascinating and I wish I could work on all of them but unfortunately I
couldn't:

1) 5.1 Fact Extraction from Wikipedia Text

2) 5.9 Keyword Search on DBpedia Data

3) 5.16 DBpedia Spotlight - Better Context Vectors

4) 5.17 DBpedia Spotlight - Better Surface form Matching

5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores

But in all these I found a couple of ideas interlinked, in other words one
solution might leads to another. Like in 5.1, 5.16, 5.17 our primary
problems are Entity Linking (EL) and Word Sense Disambiguation (WSD) from
raw text to DBpedia entities so as to understand raw text and disambiguate
senses or entities. So if we can address these two tasks efficiently then
we can solve problems associated with these three ideas.

Following are some methods which were there in the research papers
mentioned in references of these ideas.

1) FrameNet: Identify frames (indicating a particular type of situation
along with its participants, i.e. task, doer and props), and then identify
Logical Units, and their associated Frame Elements by using models trained
primarily on crowd-sourced data. Primarily used for Automatic Semantic Role
Labeling.

2) Babelfy: Using a wide semantic network, encoding structural and lexical
information of both type encyclopedic and lexicographic like Wikipedia and
WordNet resp., we can also accomplish our tasks (EL and WSD). In this a
graphical method along with some heuristics is used to extract out the most
relevant meaning from the text.

3) Word2vec / Glove - Methods for designing word vectors based on the
context. These are primarily employed for WSD.

Moreover if those problems are solved then we can address keyword search
(5.9) and Confidence Scoring (5.19) effectively as both require association
of entities to the raw text which will provide concerned entity and its
attributes to search with and the confidence score.

So I would like to work on 5.16 or 5.17 which will encompass those two
tasks (EL and WSD) and for this I would like to ask which method will be
the best for these two tasks? According to me it is the babelfy method
which will be appropriate for both of these tasks.

Thanks,
Abhishek Gupta
On Feb 23, 2015 5:46 PM, "Thiago Galery"  wrote:

> Hi Abishek, if you are interested in contributing to any DBpedia project
> or participating in Gsoc this year it might be a good idea to take a look
> at this page http://wiki.dbpedia.org/gsoc2015/ideas . This might help you
> to specify how/where you can contribute. Hope this helps,
> Thiago
>
> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta  wrote:
>
>> Hi all,
>>
>> I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
>> Delhi. Recently I have worked on the projects related to Machine Learning
>> and Natural Language Processing (i.e. Information Extraction) in which I
>> extracted Named Entities from raw text to populate knowledge base with new
>> entities. Hence I am inclined to work in this area. Besides this I am also
>> familiar with programming languages like C, C++ and Java primarily.
>>
>> So I presume that I can contribute a lot towards extracting structured
>> data from wikipedia which is one of the primary step towards Dbpedia's
>> primary goal.
>>
>> So can anyone please help me out where to start from so as to contribute
>> towards this?
>>
>> Regards
>> Abhishek Gupta
>>
>>
>> --
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>> ___
>> Dbpedia-gsoc mailing list
>> Dbpedia-gsoc@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>
>>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc


[Dbpedia-gsoc] Contribute to DbPedia

2015-02-22 Thread Abhishek Gupta
Hi all,

I am Abhishek Gupta. I am a student of Electrical Engineering from IIT
Delhi. Recently I have worked on the projects related to Machine Learning
and Natural Language Processing (i.e. Information Extraction) in which I
extracted Named Entities from raw text to populate knowledge base with new
entities. Hence I am inclined to work in this area. Besides this I am also
familiar with programming languages like C, C++ and Java primarily.

So I presume that I can contribute a lot towards extracting structured data
from wikipedia which is one of the primary step towards Dbpedia's primary
goal.

So can anyone please help me out where to start from so as to contribute
towards this?

Regards
Abhishek Gupta
--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk___
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc