Re: Named entity coref resolution based on dbpedia categories and rdf:type

Cristian Petroaca Sun, 09 Mar 2014 10:09:05 -0700

I started to implement the engine and I'm having problems with getting
results for noun phrases. I modified the "default" weighted chain to also
include the PosChunkerEngine and ran a sample text : "Angela Merkel visted
China. The german chancellor met with various people". I expected that the
RDF XML output would contain some info about the noun phrases but I cannot
see any.
Could you point me to the correct way to generate the noun phrases?


Thanks,
Cristian


2014-02-09 14:15 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com>:

> Opened https://issues.apache.org/jira/browse/STANBOL-1279
>
>
> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com>
> :
>
> Hi Rupert,
>>
>> The "spatial" dimension is a good idea. I'll also take a look at Yago.
>>
>> I will create a Jira with what we talked about here. It will probably
>> have just a draft-like description for now and will be updated as I go
>> along.
>>
>> Thanks,
>> Cristian
>>
>>
>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
>> rupert.westentha...@gmail.com>:
>>
>> Hi Cristian,
>>>
>>> definitely an interesting approach. You should have a look at Yago2
>>> [1]. As far as I can remember the Yago taxonomy is much better
>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia
>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
>>> mappings [2] and [3]
>>>
>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>:
>>> >>
>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
>>> >> huge profit".
>>>
>>> Thats actually a very good example. Spatial contexts are very
>>> important as they tend to be often used for referencing. So I would
>>> suggest to specially treat the spatial context. For spatial Entities
>>> (like a City) this is easy, but even for other (like a Person,
>>> Company) you could use relations to spatial entities define their
>>> spatial context. This context could than be used to correctly link
>>> "The Redmond's company" to "Microsoft".
>>>
>>> In addition I would suggest to use the "spatial" context of each
>>> entity (basically relation to entities that are cities, regions,
>>> countries) as a separate dimension, because those are very often used
>>> for coreferences.
>>>
>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
>>> [3]
>>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>>>
>>>
>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
>>> <cristian.petro...@gmail.com> wrote:
>>> > There are several dbpedia categories for each entity, in this case for
>>> > Microsoft we have :
>>> >
>>> > category:Companies_in_the_NASDAQ-100_Index
>>> > category:Microsoft
>>> > category:Software_companies_of_the_United_States
>>> > category:Software_companies_based_in_Washington_(state)
>>> > category:Companies_established_in_1975
>>> > category:1975_establishments_in_the_United_States
>>> > category:Companies_based_in_Redmond,_Washington
>>> > category:Multinational_companies_headquartered_in_the_United_States
>>> > category:Cloud_computing_providers
>>> > category:Companies_in_the_Dow_Jones_Industrial_Average
>>> >
>>> > So we also have "Companies based in Redmont,Washington" which could be
>>> > matched.
>>> >
>>> >
>>> > There is still other contextual information from dbpedia which can be
>>> used.
>>> > For example for an Organization we could also include :
>>> > dbpprop:industry = Software
>>> > dbpprop:service = Online Service Providers
>>> >
>>> > and for a Person (that's for Barack Obama) :
>>> >
>>> > dbpedia-owl:profession:
>>> >                                dbpedia:Author
>>> >                                dbpedia:Constitutional_law
>>> >                                dbpedia:Lawyer
>>> >                                dbpedia:Community_organizing
>>> >
>>> > I'd like to continue investigating this as I think that it may have
>>> some
>>> > value in increasing the number of coreference resolutions and I'd like
>>> to
>>> > concentrate more on precision rather than recall since we already have
>>> a
>>> > set of coreferences detected by the stanford nlp tool and this would
>>> be as
>>> > an addition to that (at least this is how I would like to use it).
>>> >
>>> > Is it ok if I track this by opening a jira? I could update it to show
>>> my
>>> > progress and also my conclusions and if it turns out that it was a bad
>>> idea
>>> > then that's the situation at least I'll end up with more knowledge
>>> about
>>> > Stanbol in the end :).
>>> >
>>> >
>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>:
>>> >
>>> >> Hi Cristian,
>>> >>
>>> >> The approach sounds nice. I don't want to be the devil's advocate but
>>> I'm
>>> >> just not sure about the recall using the dbpedia categories feature.
>>> For
>>> >> example, your sentence could be also "Microsoft posted its 2013
>>> earnings.
>>> >> The Redmond's company made a huge profit". So, maybe including more
>>> >> contextual information from dbpedia could increase the recall but of
>>> course
>>> >> will reduce the precision.
>>> >>
>>> >> Cheers,
>>> >> Rafa
>>> >>
>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
>>> >>
>>> >>  Back with a more detailed description of the steps for making this
>>> kind of
>>> >>> coreference work.
>>> >>>
>>> >>> I will be using references to the following text in the steps below
>>> in
>>> >>> order to make things clearer : "Microsoft posted its 2013 earnings.
>>> The
>>> >>> software company made a huge profit."
>>> >>>
>>> >>> 1. For every noun phrase in the text which has :
>>> >>>      a. a determinate pos which implies reference to an entity local
>>> to
>>> >>> the
>>> >>> text, such as "the, this, these") but not "another, every", etc which
>>> >>> implies a reference to an entity outside of the text.
>>> >>>      b. having at least another noun aside from the main required
>>> noun
>>> >>> which
>>> >>> further describes it. For example I will not count "The company" as
>>> being
>>> >>> a
>>> >>> legitimate candidate since this could create a lot of false
>>> positives by
>>> >>> considering the double meaning of some words such as "in the company
>>> of
>>> >>> good people".
>>> >>> "The software company" is a good candidate since we also have
>>> "software".
>>> >>>
>>> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia
>>> >>> categories of each named entity found prior to the location of the
>>> noun
>>> >>> phrase in the text.
>>> >>> The dbpedia categories are in the following format (for Microsoft for
>>> >>> example) : "Software companies of the United States".
>>> >>>   So we try to match "software company" with that.
>>> >>> First, as you can see, the main noun in the dbpedia category has a
>>> plural
>>> >>> form and it's the same for all categories which I saw. I don't know
>>> if
>>> >>> there's an easier way to do this but I thought of applying a
>>> lemmatizer on
>>> >>> the category and the noun phrase in order for them to have a common
>>> >>> denominator.This also works if the noun phrase itself has a plural
>>> form.
>>> >>>
>>> >>> Second, I'll need to use for comparison only the words in the
>>> category
>>> >>> which are themselves nouns and not prepositions or determiners such
>>> as "of
>>> >>> the".This means that I need to pos tag the categories contents as
>>> well.
>>> >>> I was thinking of running the pos and lemma on the dbpedia
>>> categories when
>>> >>> building the dbpedia backed entity hub and storing them for later
>>> use - I
>>> >>> don't know how feasible this is at the moment.
>>> >>>
>>> >>> After this I can compare each noun in the noun phrase with the
>>> equivalent
>>> >>> nouns in the categories and based on the number of matches I can
>>> create a
>>> >>> confidence level.
>>> >>>
>>> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia
>>> of the
>>> >>> named entity. If this matches increase the confidence level.
>>> >>>
>>> >>> 4. If there are multiple named entities which can match a certain
>>> noun
>>> >>> phrase then link the noun phrase with the closest named entity prior
>>> to it
>>> >>> in the text.
>>> >>>
>>> >>> What do you think?
>>> >>>
>>> >>> Cristian
>>> >>>
>>> >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>:
>>> >>>
>>> >>>  Hi Rafa,
>>> >>>>
>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
>>> provide
>>> >>>> it here so that you guys can give me a feedback on it.
>>> >>>>
>>> >>>> What are "locality" features?
>>> >>>>
>>> >>>> I looked at Bart and other coref tools such as ArkRef and
>>> CherryPicker
>>> >>>> and
>>> >>>> they don't provide such a coreference.
>>> >>>>
>>> >>>> Cristian
>>> >>>>
>>> >>>>
>>> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>:
>>> >>>>
>>> >>>> Hi Cristian,
>>> >>>>
>>> >>>>> Without having more details about your concrete heuristic, in my
>>> honest
>>> >>>>> opinion, such approach could produce a lot of false positives. I
>>> don't
>>> >>>>> know
>>> >>>>> if you are planning to use some "locality" features to detect such
>>> >>>>> coreferences but you need to take into account that it is quite
>>> usual
>>> >>>>> that
>>> >>>>> coreferenced mentions can occurs even in different paragraphs.
>>> Although
>>> >>>>> I'm
>>> >>>>> not an expert in Natural Language Understanding, I would say it is
>>> quite
>>> >>>>> difficult to get decent precision/recall rates for coreferencing
>>> using
>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART (
>>> >>>>> http://www.bart-coref.org/).
>>> >>>>>
>>> >>>>> Cheers,
>>> >>>>> Rafa Haro
>>> >>>>>
>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>>> >>>>>
>>> >>>>>   Hi,
>>> >>>>>
>>> >>>>>> One of the necessary steps for implementing the Event extraction
>>> Engine
>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is
>>> to
>>> >>>>>> have
>>> >>>>>> coreference resolution in the given text. This is provided now
>>> via the
>>> >>>>>> stanford-nlp project but as far as I saw this module is performing
>>> >>>>>> mostly
>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
>>> coreference
>>> >>>>>> resolution.
>>> >>>>>>
>>> >>>>>> In order to get more coreferences from the text I though of
>>> creating
>>> >>>>>> some
>>> >>>>>> logic that would detect this kind of coreference :
>>> >>>>>> "Apple reaches new profit heights. The software company just
>>> announced
>>> >>>>>> its
>>> >>>>>> 2013 earnings."
>>> >>>>>> Here "The software company" obviously refers to "Apple".
>>> >>>>>> So I'd like to detect coreferences of Named Entities which are of
>>> the
>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also
>>> have
>>> >>>>>> attributes which can be found in the dbpedia categories of the
>>> named
>>> >>>>>> entity, in this case "software".
>>> >>>>>>
>>> >>>>>> The detection of coreferences such as "The software company" in
>>> the
>>> >>>>>> text
>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase
>>> >>>>>> extraction
>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the
>>> sentence and
>>> >>>>>> picking up only subjects or objects.
>>> >>>>>>
>>> >>>>>> At this point I'd like to know if this kind of logic would be
>>> useful
>>> >>>>>> as a
>>> >>>>>> separate Enhancement Engine (in case the precision and recall are
>>> good
>>> >>>>>> enough) in Stanbol?
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Cristian
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>
>

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to