Re: [Dbpedia-gsoc] Contribute to DbPedia

David Przybilla Thu, 12 Mar 2015 01:56:14 -0700

Hi Abhishek,

Those sounds like good ideas. Please check my answers below:


On Wed, Mar 11, 2015 at 2:34 PM, Thiago Galery <tgal...@gmail.com> wrote:

> Thanks for your input Abhishek! Here are some thoughts:
>
> On Wed, Mar 11, 2015 at 7:40 AM, Abhishek Gupta <a.gu...@gmail.com> wrote:
>
>> Hi all,
>>
>> After looking out pr:
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/298, which I
>> assuming Thiago is referring to, some questions came into my mind to
>> clarify the context:
>>
>
> Yeah, this is indeed the PR I had in mind.
>
>
>> Am I right with these two observations?
>> 1) We are creating SF store for fast probable candidate entities
>> extraction
>>
>
> More or less, the SF store simply stores the association of a given
> surface form to a set of candidate entities. It's quite a common thing in
> entity linking systems, whether having it as a 'store' or some other data
> structure varies a lot though.
>
>
>> 2) We are thinking of stemming so as to solve the problem of finding
>> probable candidates in case we don't have earlier information of a surface
>> form attaching to a particular entity like 'The giants' associated to a
>> candidate X, but when a new surface form 'giant' appears it will not
>> recognize it because 'giants' was not used earlier in the context of X.
>>
>
> Yeah, simply put: 'giant' can be associated with a set of candidates that
> differs from that associated to 'giant'.
>
>
>>
>> For handling the above issue I would like to state a couple of approaches:
>>
>> *Approach 1: This approach have high Space Complexity*
>>
>> From my perspective, when we create a SF storage it should satisfy
>> following properties:
>> 1) Extracting at least the correct surface form in almost all cases
>> disregarding how many SFs we are extracting. by keeping the Surface Form
>> selection loose.
>> 2) We should be able to identify even that SF which we have not seen in
>> our data e.g. HD -> Harley Davidson
>>
>> This situation might lead us to quite a large number candidate entities
>> and hence difficult disambiguation but this is the cost we might have to
>> pay for detecting unseen instances. But disambiguation might be able to
>> handle this because the context of entities will be quite different in most
>> of the cases.
>>
>> Perspective solutions to satisfy above properties:
>> 1) Keeping SF selection hence candidate selection algorithm loose.
>> 2) For this I would like to introduce another design besides stemming use
>> following pipeline:
>>     a) Convert every entity using a function and find
>> probableSurfaceForms_1 (pSF1) like below:
>>          (i) Entity Retained: "Michael Jeffery Jordan" -> "Michael
>> Jeffery Jordan"
>>          (ii) Acronym: "Michael Jeffery Jordan" -> "MJJ"
>>          (iii) Omission: "Michael Jeffery Jordan" -> "Michael Jordan"
>>          (iv) Combination of (i), (ii), and (iii): "M.J. Jordan", "M.
>> Jordan", "M." (like Mr. M.)
>>     b) Convert pSF1 to pSF2:
>>          Step1: Remove Determiners, Prepositions, Stop-words, Punctuation
>>          Step2: Convert to lowercase
>>     c) Perform stemming on pSF2 & convert pSF2 to stemmedSurfaceForm
>> (sFF)
>>     d) Store sFFs with indexes of corresponding entities. (We are not
>> storing pSF1s or pSF2s)
>>
>
>
One case which commonly occurs is combination of upper and lower cases i.e:
"Michael jordan" but yeah :) this is a good direction. We partially tried
to implement something inthat direction.


> We tried to do something like this in the PR you mentioned, but in a way
> less systematic way. So your suggestions are welcome.
> One thing that you need to worry about though concerns step (b). Spotlight
> is a bit language agnostic, so we would need to add information about
> determiners, prepositions and so on for a series of language. This is not
> very complicated but worth keeping in mind.
>
>
>
>
>>
>> Now let's approach from raw text - After spotting a sequence of tokens
>> (SoT) in raw text, which might be an entity reference, we should pass this
>> sequence of tokens through the same function that I mentioned in step 2
>> (part (a) and (b)) and then match the output with stored sFFs. And then we
>> can find our concerning entities. We can then also calculate the relevance
>> between sequence of tokes and sFFs corresponding entities using a function
>> like Levenshtein Distance.
>>
>> I am doing some additional steps so as to address following rare
>> situation which we don't have in our data:
>> "My *HD* is world's most amazing motorcycles I have ever seen."
>> This situation is quite unlike as there might not be any reference from
>> HD to Harley Davidson but by context we can infer that it might be Harley
>> Davidson using the above approach.
>>
>>
> I'm not sure I get this, but we would definitely need to review the way we
> score the association between a surface form and a given candidate. In your
> example you rely on the contextual score, but it's very important to keep
> in mind that in order for the loose matching approach to work, we would
> need to do some improvements on the context store as well. This is why
> there's another gsoc Idea related to that.
>
>
>> *Approach 2: This approach have might have high Time Complexity*
>>
>> Instead of finding candidate entities without using context we can use
>> our context to some extent.
>> 1) We locate our context in a connected entities graph using the context
>> of sequence of tokens.
>> 2) Find all entities linked to our context and they will be our candidate
>> entities using Babelfy approach.
>> 3) Pass all the candidate entities to the function mentioned in step 2 of
>> Approach 1.
>> 4) Pass SoT from the same function (part (a) and (b))
>> 5) Score candidates using Levenshtein Distance
>>
>> Actually in Approach 2 we are doing a bit of disambiguation in Step 1
>> itself which will reduce our count of sFFs.
>>
>> Please review these ideas and provide your feedback.
>>
>>
> I'm not sure whether I understand this entirely, but I'm very interested
> in other ways to conceptualise context. Spotlight just uses a simple
> distributional method, but you can definitely use the link structure within
> wikipedia to find candidates that are more related to themselves. In your
> example above the pair Motorcycle - Harley Davidson would be much more
> related than Motorcycle - Hard Drive for example. However, this would
> require coding from scratch, so bear in mind that it might be too much
> work.
>
>
>> Moreover I am trying to setting up the server on my PC itself which is
>> taking some time due to a 10Gb file. I will come up with results as soon as
>> I got some results. Till then I might follow up with some other warm-up
>> task which is related to project ideas 5.15 and 5.16.
>>
>>

The english model is a bit big. Consider using a smaller model for playing
i.e:  Danish, Turkish, Dutch or Spanish



> Regards,
>> Abhishek
>>
>
> Let us know if you need any help.
>
> All the best,
>
> Thiago Galery
>
>>
>> On Sun, Mar 8, 2015 at 11:47 PM, Thiago Galery <tgal...@gmail.com> wrote:
>>
>>> Hi Abhishek, here are some thoughts about some of your questions:
>>>
>>> I would like to ask a few questions:
>>>> 1) Are we designing these vectors to use in the disambiguation step of
>>>> Entity Linking (matching raw text entity to KB entity) or Is there any
>>>> other task we have in mind where these vectors can be employed?
>>>>
>>>
>>>
>>> The main focus would be disambiguation, but one could reuse the
>>> contextual score of the entity to determine how relevant that entity is for
>>> the text.
>>>
>>>
>>>
>>>> 2) At present which model is used for disambiguation in
>>>> dbpedia-spotlight?
>>>>
>>>
>>>
>>> Correct me if I am wrong, but I think that disambiguation is done by
>>> cosine similarity (on term frequency) between the context surrounding the
>>> extracted surface form and the context associated with each candidate
>>> entity associated with that surface form.
>>>
>>>
>>>> 3) Are we trying to focus on modelling context vectors for infrequent
>>>> words primarily as there might not have enough information hence difficult
>>>> to model?
>>>>
>>>
>>> The problem is not related to frequent words per se, but more about how
>>> the context for each entity is determined. The map reduce job that
>>> generates the stats used by spotlight extracts the surrounding words
>>> (according to a window and other constraints) of each link to an entity and
>>> counts them, which means that heavily linked entities have a larger context
>>> than no so frequently linked ones. This creates a heavy bias for
>>> disambiguating certain entities, hence a case where smoothing might be a
>>> good call.
>>>
>>>
>>>
>>>>
>>>>
>>>> Regarding Project 5.16 (DBpedia Spotlight - Better Surface form
>>>> Matching):
>>>>
>>>> *How to deal with linguistic variation: lowercase/uppercase surface
>>>> forms, determiners, accents, unicode, in a way such that the right
>>>> generalizations can be made and some form of probabilistic structured can
>>>> be determined in a principled way?*
>>>> For dealing with linguistic variations we can calculate lexical
>>>> translation probability from all probable name mentions to entities in KB
>>>> as shown in Entity Name Model in [2].
>>>>
>>>> *Improve the memory footprint of the stores that hold the surface forms
>>>> and their associated entities.*
>>>> In what respect we are planning to improve footprints whether in terms
>>>> of space or association or something else?
>>>>
>>>> For this project I have a couple of questions in mind:
>>>> 1) Are we planning to improve the same model that we are using in
>>>> dbpedia-spotlight for entity linking?
>>>>
>>>
>>> Yes
>>>
>>>
>>>> 2) If not we can change the whole model itself to something else like:
>>>>     a) Generative Model [2]
>>>>     b) Discriminative Model [3]
>>>>     c) Graph Based [4] - Babelfy
>>>>     d) Probabilistic Graph Based
>>>>
>>>
>>> Incorporating something like (c) or (d) might be a good call, but might
>>> be way bigger than one summer.
>>>
>>>
>>>
>>>> 3) Why are we planning to store surface forms with associated entities
>>>> instead of finding associated entities during disambiguation itself?
>>>>
>>>
>>> No sure what you mean by that.
>>>
>>>
>>>> Besides this I would also like to know regarding warm-up task I have to
>>>> do.
>>>>
>>>
>>> If you check the pull request page in spolight, @dav009 has a PR which
>>> he claims to be a mere *Idea* but forces surface forms to be stemmed before
>>> storing. Pulling from that branch, recompiling, running spotlight and
>>> seeing some of the results would be a good start. You can also nag us on
>>> that issue about ideas you might have after you understand the code.
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Abhishek Gupta
>>>>
>>>> [1]
>>>> https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit?usp=sharing
>>>> [2] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>>> [3]
>>>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
>>>> [4]
>>>> http://wwwusers.di.uniroma1.it/~moro/MoroRaganatoNavigli_TACL2014.pdf
>>>> [5] http://www.aclweb.org/anthology/D11-1072
>>>>
>>>> On Tue, Mar 3, 2015 at 2:01 AM, David Przybilla <
>>>> dav.alejan...@gmail.com> wrote:
>>>>
>>>>> Hi Abhisek,
>>>>>
>>>>> There is a lot of experimentation which can be done with both 5.16
>>>>>  and 5.17.
>>>>>
>>>>> In my opinion the current problem is that the Surface Form(SF)
>>>>> matching is a bit poor.
>>>>> Mixing the Babelfy Superstring matching with other ideas to make SF
>>>>> spotting better could be a great start.
>>>>> You can also bring ideas from papers such as [1] in order to address
>>>>> more linguistic variations.
>>>>>
>>>>> It's hard to debate which one is better, however you can mix ideas
>>>>>  i.e: use superstring matching to greedy match more Surface forms with 
>>>>> more
>>>>> linguistic variations, while using word2vec in the disambiguation stage.
>>>>>
>>>>> Feel free to poke me if you would like to discuss in more detail :)
>>>>>
>>>>>
>>>>> [1] https://aclweb.org/anthology/P/P11/P11-1095.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 2, 2015 at 7:21 PM, Abhishek Gupta <a.gu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently I checked out the ideas list of DBpedia for GSoC 2015 and I
>>>>>> should admit one thing that every idea is more interesting than the
>>>>>> previous one. While I was looking out for ideas that interests me I found
>>>>>> following ideas most fascinating and I wish I could work on all of them 
>>>>>> but
>>>>>> unfortunately I couldn't:
>>>>>>
>>>>>> 1) 5.1 Fact Extraction from Wikipedia Text
>>>>>>
>>>>>> 2) 5.9 Keyword Search on DBpedia Data
>>>>>>
>>>>>> 3) 5.16 DBpedia Spotlight - Better Context Vectors
>>>>>>
>>>>>> 4) 5.17 DBpedia Spotlight - Better Surface form Matching
>>>>>>
>>>>>> 5) 5.19 DBpedia Spotlight - Confidence / Relevance Scores
>>>>>>
>>>>>> But in all these I found a couple of ideas interlinked, in other
>>>>>> words one solution might leads to another. Like in 5.1, 5.16, 5.17 our
>>>>>> primary problems are Entity Linking (EL) and Word Sense Disambiguation
>>>>>> (WSD) from raw text to DBpedia entities so as to understand raw text and
>>>>>> disambiguate senses or entities. So if we can address these two tasks
>>>>>> efficiently then we can solve problems associated with these three ideas.
>>>>>>
>>>>>> Following are some methods which were there in the research papers
>>>>>> mentioned in references of these ideas.
>>>>>>
>>>>>> 1) FrameNet: Identify frames (indicating a particular type of
>>>>>> situation along with its participants, i.e. task, doer and props), and 
>>>>>> then
>>>>>> identify Logical Units, and their associated Frame Elements by using 
>>>>>> models
>>>>>> trained primarily on crowd-sourced data. Primarily used for Automatic
>>>>>> Semantic Role Labeling.
>>>>>>
>>>>>> 2) Babelfy: Using a wide semantic network, encoding structural and
>>>>>> lexical information of both type encyclopedic and lexicographic like
>>>>>> Wikipedia and WordNet resp., we can also accomplish our tasks (EL and 
>>>>>> WSD).
>>>>>> In this a graphical method along with some heuristics is used to extract
>>>>>> out the most relevant meaning from the text.
>>>>>>
>>>>>> 3) Word2vec / Glove - Methods for designing word vectors based on the
>>>>>> context. These are primarily employed for WSD.
>>>>>>
>>>>>> Moreover if those problems are solved then we can address keyword
>>>>>> search (5.9) and Confidence Scoring (5.19) effectively as both require
>>>>>> association of entities to the raw text which will provide concerned 
>>>>>> entity
>>>>>> and its attributes to search with and the confidence score.
>>>>>>
>>>>>> So I would like to work on 5.16 or 5.17 which will encompass those
>>>>>> two tasks (EL and WSD) and for this I would like to ask which method will
>>>>>> be the best for these two tasks? According to me it is the babelfy method
>>>>>> which will be appropriate for both of these tasks.
>>>>>>
>>>>>> Thanks,
>>>>>> Abhishek Gupta
>>>>>> On Feb 23, 2015 5:46 PM, "Thiago Galery" <tgal...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Abishek, if you are interested in contributing to any DBpedia
>>>>>>> project or participating in Gsoc this year it might be a good idea to 
>>>>>>> take
>>>>>>> a look at this page http://wiki.dbpedia.org/gsoc2015/ideas . This
>>>>>>> might help you to specify how/where you can contribute. Hope this helps,
>>>>>>> Thiago
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 2:09 PM, Abhishek Gupta <a.gu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am Abhishek Gupta. I am a student of Electrical Engineering from
>>>>>>>> IIT Delhi. Recently I have worked on the projects related to Machine
>>>>>>>> Learning and Natural Language Processing (i.e. Information Extraction) 
>>>>>>>> in
>>>>>>>> which I extracted Named Entities from raw text to populate knowledge 
>>>>>>>> base
>>>>>>>> with new entities. Hence I am inclined to work in this area. Besides 
>>>>>>>> this I
>>>>>>>> am also familiar with programming languages like C, C++ and Java 
>>>>>>>> primarily.
>>>>>>>>
>>>>>>>> So I presume that I can contribute a lot towards extracting
>>>>>>>> structured data from wikipedia which is one of the primary step towards
>>>>>>>> Dbpedia's primary goal.
>>>>>>>>
>>>>>>>> So can anyone please help me out where to start from so as to
>>>>>>>> contribute towards this?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Abhishek Gupta
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>>>>>> from Actuate! Instantly Supercharge Your Business Reports and
>>>>>>>> Dashboards
>>>>>>>> with Interactivity, Sharing, Native Excel Exports, App Integration
>>>>>>>> & more
>>>>>>>> Get technology previously reserved for billion-dollar corporations,
>>>>>>>> FREE
>>>>>>>>
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>>>>>>> _______________________________________________
>>>>>>>> Dbpedia-gsoc mailing list
>>>>>>>> Dbpedia-gsoc@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>>>>> sponsored
>>>>>> by Intel and developed in partnership with Slashdot Media, is your
>>>>>> hub for all
>>>>>> things parallel software development, from weekly thought leadership
>>>>>> blogs to
>>>>>> news, videos, case studies, tutorials and more. Take a look and join
>>>>>> the
>>>>>> conversation now. http://goparallel.sourceforge.net/
>>>>>> _______________________________________________
>>>>>> Dbpedia-gsoc mailing list
>>>>>> Dbpedia-gsoc@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Dbpedia-gsoc mailing list
> Dbpedia-gsoc@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Contribute to DbPedia

Reply via email to