Re: Named entity coref resolution based on dbpedia categories and rdf:type

Rupert Westenthaler Tue, 11 Mar 2014 00:48:25 -0700

Hi Cristian,

NER Annotations are typically available as both
NlpAnnotations.NER_ANNOTATION and  fise:TextAnnotation [1] in the
enhancement metadata. As you are already accessing the AnayzedText I
would prefer using the  NlpAnnotations.NER_ANNOTATION.


best
Rupert

[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation

On Mon, Mar 10, 2014 at 10:07 PM, Cristian Petroaca
<[email protected]> wrote:
> Thanks.
> I assume I should get the Named entities using the same but with
> NlpAnnotations.NER_ANNOTATION?
>
>
>
> 2014-03-10 13:29 GMT+02:00 Rupert Westenthaler <
> [email protected]>:
>
>> Hallo Cristian,
>>
>> NounPhrases are not added to the RDF enhancement results. You need to
>> use the AnalyzedText ContentPart [1]
>>
>> here is some demo code you can use in the computeEnhancement method
>>
>>         AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
>>         Iterator<? extends Section> sections = at.getSentences();
>>         if(!sections.hasNext()){ //process as single sentence
>>             sections = Collections.singleton(at).iterator();
>>         }
>>
>>         while(sections.hasNext()){
>>             Section section = sections.next();
>>             Iterator<Span> chunks =
>> section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
>>             while(chunks.hasNext()){
>>                 Span chunk = chunks.next();
>>                 Value<PhraseTag> phrase =
>> chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
>>                 if(phrase.value().getCategory() == LexicalCategory.Noun){
>>                     log.info(" - NounPhrase [{},{}] {}", new Object[]{
>>
>> chunk.getStart(),chunk.getEnd(),chunk.getSpan()});
>>                 }
>>             }
>>         }
>>
>> hope this helps
>>
>> best
>> Rupert
>>
>> [1]
>> http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext
>>
>> On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca
>> <[email protected]> wrote:
>> > I started to implement the engine and I'm having problems with getting
>> > results for noun phrases. I modified the "default" weighted chain to also
>> > include the PosChunkerEngine and ran a sample text : "Angela Merkel
>> visted
>> > China. The german chancellor met with various people". I expected that
>> the
>> > RDF XML output would contain some info about the noun phrases but I
>> cannot
>> > see any.
>> > Could you point me to the correct way to generate the noun phrases?
>> >
>> > Thanks,
>> > Cristian
>> >
>> >
>> > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <
>> [email protected]>:
>> >
>> >> Opened https://issues.apache.org/jira/browse/STANBOL-1279
>> >>
>> >>
>> >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <
>> [email protected]>
>> >> :
>> >>
>> >> Hi Rupert,
>> >>>
>> >>> The "spatial" dimension is a good idea. I'll also take a look at Yago.
>> >>>
>> >>> I will create a Jira with what we talked about here. It will probably
>> >>> have just a draft-like description for now and will be updated as I go
>> >>> along.
>> >>>
>> >>> Thanks,
>> >>> Cristian
>> >>>
>> >>>
>> >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
>> >>> [email protected]>:
>> >>>
>> >>> Hi Cristian,
>> >>>>
>> >>>> definitely an interesting approach. You should have a look at Yago2
>> >>>> [1]. As far as I can remember the Yago taxonomy is much better
>> >>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia
>> >>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
>> >>>> mappings [2] and [3]
>> >>>>
>> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>> >>>> >>
>> >>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
>> >>>> >> huge profit".
>> >>>>
>> >>>> Thats actually a very good example. Spatial contexts are very
>> >>>> important as they tend to be often used for referencing. So I would
>> >>>> suggest to specially treat the spatial context. For spatial Entities
>> >>>> (like a City) this is easy, but even for other (like a Person,
>> >>>> Company) you could use relations to spatial entities define their
>> >>>> spatial context. This context could than be used to correctly link
>> >>>> "The Redmond's company" to "Microsoft".
>> >>>>
>> >>>> In addition I would suggest to use the "spatial" context of each
>> >>>> entity (basically relation to entities that are cities, regions,
>> >>>> countries) as a separate dimension, because those are very often used
>> >>>> for coreferences.
>> >>>>
>> >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
>> >>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
>> >>>> [3]
>> >>>>
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>> >>>>
>> >>>>
>> >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
>> >>>> <[email protected]> wrote:
>> >>>> > There are several dbpedia categories for each entity, in this case
>> for
>> >>>> > Microsoft we have :
>> >>>> >
>> >>>> > category:Companies_in_the_NASDAQ-100_Index
>> >>>> > category:Microsoft
>> >>>> > category:Software_companies_of_the_United_States
>> >>>> > category:Software_companies_based_in_Washington_(state)
>> >>>> > category:Companies_established_in_1975
>> >>>> > category:1975_establishments_in_the_United_States
>> >>>> > category:Companies_based_in_Redmond,_Washington
>> >>>> > category:Multinational_companies_headquartered_in_the_United_States
>> >>>> > category:Cloud_computing_providers
>> >>>> > category:Companies_in_the_Dow_Jones_Industrial_Average
>> >>>> >
>> >>>> > So we also have "Companies based in Redmont,Washington" which could
>> be
>> >>>> > matched.
>> >>>> >
>> >>>> >
>> >>>> > There is still other contextual information from dbpedia which can
>> be
>> >>>> used.
>> >>>> > For example for an Organization we could also include :
>> >>>> > dbpprop:industry = Software
>> >>>> > dbpprop:service = Online Service Providers
>> >>>> >
>> >>>> > and for a Person (that's for Barack Obama) :
>> >>>> >
>> >>>> > dbpedia-owl:profession:
>> >>>> >                                dbpedia:Author
>> >>>> >                                dbpedia:Constitutional_law
>> >>>> >                                dbpedia:Lawyer
>> >>>> >                                dbpedia:Community_organizing
>> >>>> >
>> >>>> > I'd like to continue investigating this as I think that it may have
>> >>>> some
>> >>>> > value in increasing the number of coreference resolutions and I'd
>> like
>> >>>> to
>> >>>> > concentrate more on precision rather than recall since we already
>> have
>> >>>> a
>> >>>> > set of coreferences detected by the stanford nlp tool and this would
>> >>>> be as
>> >>>> > an addition to that (at least this is how I would like to use it).
>> >>>> >
>> >>>> > Is it ok if I track this by opening a jira? I could update it to
>> show
>> >>>> my
>> >>>> > progress and also my conclusions and if it turns out that it was a
>> bad
>> >>>> idea
>> >>>> > then that's the situation at least I'll end up with more knowledge
>> >>>> about
>> >>>> > Stanbol in the end :).
>> >>>> >
>> >>>> >
>> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>> >>>> >
>> >>>> >> Hi Cristian,
>> >>>> >>
>> >>>> >> The approach sounds nice. I don't want to be the devil's advocate
>> but
>> >>>> I'm
>> >>>> >> just not sure about the recall using the dbpedia categories
>> feature.
>> >>>> For
>> >>>> >> example, your sentence could be also "Microsoft posted its 2013
>> >>>> earnings.
>> >>>> >> The Redmond's company made a huge profit". So, maybe including more
>> >>>> >> contextual information from dbpedia could increase the recall but
>> of
>> >>>> course
>> >>>> >> will reduce the precision.
>> >>>> >>
>> >>>> >> Cheers,
>> >>>> >> Rafa
>> >>>> >>
>> >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
>> >>>> >>
>> >>>> >>  Back with a more detailed description of the steps for making this
>> >>>> kind of
>> >>>> >>> coreference work.
>> >>>> >>>
>> >>>> >>> I will be using references to the following text in the steps
>> below
>> >>>> in
>> >>>> >>> order to make things clearer : "Microsoft posted its 2013
>> earnings.
>> >>>> The
>> >>>> >>> software company made a huge profit."
>> >>>> >>>
>> >>>> >>> 1. For every noun phrase in the text which has :
>> >>>> >>>      a. a determinate pos which implies reference to an entity
>> local
>> >>>> to
>> >>>> >>> the
>> >>>> >>> text, such as "the, this, these") but not "another, every", etc
>> which
>> >>>> >>> implies a reference to an entity outside of the text.
>> >>>> >>>      b. having at least another noun aside from the main required
>> >>>> noun
>> >>>> >>> which
>> >>>> >>> further describes it. For example I will not count "The company"
>> as
>> >>>> being
>> >>>> >>> a
>> >>>> >>> legitimate candidate since this could create a lot of false
>> >>>> positives by
>> >>>> >>> considering the double meaning of some words such as "in the
>> company
>> >>>> of
>> >>>> >>> good people".
>> >>>> >>> "The software company" is a good candidate since we also have
>> >>>> "software".
>> >>>> >>>
>> >>>> >>> 2. match the nouns in the noun phrase to the contents of the
>> dbpedia
>> >>>> >>> categories of each named entity found prior to the location of the
>> >>>> noun
>> >>>> >>> phrase in the text.
>> >>>> >>> The dbpedia categories are in the following format (for Microsoft
>> for
>> >>>> >>> example) : "Software companies of the United States".
>> >>>> >>>   So we try to match "software company" with that.
>> >>>> >>> First, as you can see, the main noun in the dbpedia category has a
>> >>>> plural
>> >>>> >>> form and it's the same for all categories which I saw. I don't
>> know
>> >>>> if
>> >>>> >>> there's an easier way to do this but I thought of applying a
>> >>>> lemmatizer on
>> >>>> >>> the category and the noun phrase in order for them to have a
>> common
>> >>>> >>> denominator.This also works if the noun phrase itself has a plural
>> >>>> form.
>> >>>> >>>
>> >>>> >>> Second, I'll need to use for comparison only the words in the
>> >>>> category
>> >>>> >>> which are themselves nouns and not prepositions or determiners
>> such
>> >>>> as "of
>> >>>> >>> the".This means that I need to pos tag the categories contents as
>> >>>> well.
>> >>>> >>> I was thinking of running the pos and lemma on the dbpedia
>> >>>> categories when
>> >>>> >>> building the dbpedia backed entity hub and storing them for later
>> >>>> use - I
>> >>>> >>> don't know how feasible this is at the moment.
>> >>>> >>>
>> >>>> >>> After this I can compare each noun in the noun phrase with the
>> >>>> equivalent
>> >>>> >>> nouns in the categories and based on the number of matches I can
>> >>>> create a
>> >>>> >>> confidence level.
>> >>>> >>>
>> >>>> >>> 3. match the noun of the noun phrase with the rdf:type from
>> dbpedia
>> >>>> of the
>> >>>> >>> named entity. If this matches increase the confidence level.
>> >>>> >>>
>> >>>> >>> 4. If there are multiple named entities which can match a certain
>> >>>> noun
>> >>>> >>> phrase then link the noun phrase with the closest named entity
>> prior
>> >>>> to it
>> >>>> >>> in the text.
>> >>>> >>>
>> >>>> >>> What do you think?
>> >>>> >>>
>> >>>> >>> Cristian
>> >>>> >>>
>> >>>> >>> 2014-01-31 Cristian Petroaca <[email protected]>:
>> >>>> >>>
>> >>>> >>>  Hi Rafa,
>> >>>> >>>>
>> >>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
>> >>>> provide
>> >>>> >>>> it here so that you guys can give me a feedback on it.
>> >>>> >>>>
>> >>>> >>>> What are "locality" features?
>> >>>> >>>>
>> >>>> >>>> I looked at Bart and other coref tools such as ArkRef and
>> >>>> CherryPicker
>> >>>> >>>> and
>> >>>> >>>> they don't provide such a coreference.
>> >>>> >>>>
>> >>>> >>>> Cristian
>> >>>> >>>>
>> >>>> >>>>
>> >>>> >>>> 2014-01-30 Rafa Haro <[email protected]>:
>> >>>> >>>>
>> >>>> >>>> Hi Cristian,
>> >>>> >>>>
>> >>>> >>>>> Without having more details about your concrete heuristic, in my
>> >>>> honest
>> >>>> >>>>> opinion, such approach could produce a lot of false positives. I
>> >>>> don't
>> >>>> >>>>> know
>> >>>> >>>>> if you are planning to use some "locality" features to detect
>> such
>> >>>> >>>>> coreferences but you need to take into account that it is quite
>> >>>> usual
>> >>>> >>>>> that
>> >>>> >>>>> coreferenced mentions can occurs even in different paragraphs.
>> >>>> Although
>> >>>> >>>>> I'm
>> >>>> >>>>> not an expert in Natural Language Understanding, I would say it
>> is
>> >>>> quite
>> >>>> >>>>> difficult to get decent precision/recall rates for coreferencing
>> >>>> using
>> >>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART
>> (
>> >>>> >>>>> http://www.bart-coref.org/).
>> >>>> >>>>>
>> >>>> >>>>> Cheers,
>> >>>> >>>>> Rafa Haro
>> >>>> >>>>>
>> >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>> >>>> >>>>>
>> >>>> >>>>>   Hi,
>> >>>> >>>>>
>> >>>> >>>>>> One of the necessary steps for implementing the Event
>> extraction
>> >>>> Engine
>> >>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121is
>> >>>> to
>> >>>> >>>>>> have
>> >>>> >>>>>> coreference resolution in the given text. This is provided now
>> >>>> via the
>> >>>> >>>>>> stanford-nlp project but as far as I saw this module is
>> performing
>> >>>> >>>>>> mostly
>> >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
>> >>>> coreference
>> >>>> >>>>>> resolution.
>> >>>> >>>>>>
>> >>>> >>>>>> In order to get more coreferences from the text I though of
>> >>>> creating
>> >>>> >>>>>> some
>> >>>> >>>>>> logic that would detect this kind of coreference :
>> >>>> >>>>>> "Apple reaches new profit heights. The software company just
>> >>>> announced
>> >>>> >>>>>> its
>> >>>> >>>>>> 2013 earnings."
>> >>>> >>>>>> Here "The software company" obviously refers to "Apple".
>> >>>> >>>>>> So I'd like to detect coreferences of Named Entities which are
>> of
>> >>>> the
>> >>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also
>> >>>> have
>> >>>> >>>>>> attributes which can be found in the dbpedia categories of the
>> >>>> named
>> >>>> >>>>>> entity, in this case "software".
>> >>>> >>>>>>
>> >>>> >>>>>> The detection of coreferences such as "The software company" in
>> >>>> the
>> >>>> >>>>>> text
>> >>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase
>> >>>> >>>>>> extraction
>> >>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the
>> >>>> sentence and
>> >>>> >>>>>> picking up only subjects or objects.
>> >>>> >>>>>>
>> >>>> >>>>>> At this point I'd like to know if this kind of logic would be
>> >>>> useful
>> >>>> >>>>>> as a
>> >>>> >>>>>> separate Enhancement Engine (in case the precision and recall
>> are
>> >>>> good
>> >>>> >>>>>> enough) in Stanbol?
>> >>>> >>>>>>
>> >>>> >>>>>> Thanks,
>> >>>> >>>>>> Cristian
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>>>>>
>> >>>> >>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> | Rupert Westenthaler             [email protected]
>> >>>> | Bodenlehenstraße 11                             ++43-699-11108907
>> >>>> | A-5500 Bischofshofen
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to