Re: Named entity coref resolution based on dbpedia categories and rdf:type

Cristian Petroaca Mon, 10 Mar 2014 14:08:27 -0700

Thanks.
I assume I should get the Named entities using the same but with
NlpAnnotations.NER_ANNOTATION?




2014-03-10 13:29 GMT+02:00 Rupert Westenthaler <
[email protected]>:

> Hallo Cristian,
>
> NounPhrases are not added to the RDF enhancement results. You need to
> use the AnalyzedText ContentPart [1]
>
> here is some demo code you can use in the computeEnhancement method
>
>         AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
>         Iterator<? extends Section> sections = at.getSentences();
>         if(!sections.hasNext()){ //process as single sentence
>             sections = Collections.singleton(at).iterator();
>         }
>
>         while(sections.hasNext()){
>             Section section = sections.next();
>             Iterator<Span> chunks =
> section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
>             while(chunks.hasNext()){
>                 Span chunk = chunks.next();
>                 Value<PhraseTag> phrase =
> chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
>                 if(phrase.value().getCategory() == LexicalCategory.Noun){
>                     log.info(" - NounPhrase [{},{}] {}", new Object[]{
>
> chunk.getStart(),chunk.getEnd(),chunk.getSpan()});
>                 }
>             }
>         }
>
> hope this helps
>
> best
> Rupert
>
> [1]
> http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext
>
> On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca
> <[email protected]> wrote:
> > I started to implement the engine and I'm having problems with getting
> > results for noun phrases. I modified the "default" weighted chain to also
> > include the PosChunkerEngine and ran a sample text : "Angela Merkel
> visted
> > China. The german chancellor met with various people". I expected that
> the
> > RDF XML output would contain some info about the noun phrases but I
> cannot
> > see any.
> > Could you point me to the correct way to generate the noun phrases?
> >
> > Thanks,
> > Cristian
> >
> >
> > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <
> [email protected]>:
> >
> >> Opened https://issues.apache.org/jira/browse/STANBOL-1279
> >>
> >>
> >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <
> [email protected]>
> >> :
> >>
> >> Hi Rupert,
> >>>
> >>> The "spatial" dimension is a good idea. I'll also take a look at Yago.
> >>>
> >>> I will create a Jira with what we talked about here. It will probably
> >>> have just a draft-like description for now and will be updated as I go
> >>> along.
> >>>
> >>> Thanks,
> >>> Cristian
> >>>
> >>>
> >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
> >>> [email protected]>:
> >>>
> >>> Hi Cristian,
> >>>>
> >>>> definitely an interesting approach. You should have a look at Yago2
> >>>> [1]. As far as I can remember the Yago taxonomy is much better
> >>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia
> >>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
> >>>> mappings [2] and [3]
> >>>>
> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
> >>>> >>
> >>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
> >>>> >> huge profit".
> >>>>
> >>>> Thats actually a very good example. Spatial contexts are very
> >>>> important as they tend to be often used for referencing. So I would
> >>>> suggest to specially treat the spatial context. For spatial Entities
> >>>> (like a City) this is easy, but even for other (like a Person,
> >>>> Company) you could use relations to spatial entities define their
> >>>> spatial context. This context could than be used to correctly link
> >>>> "The Redmond's company" to "Microsoft".
> >>>>
> >>>> In addition I would suggest to use the "spatial" context of each
> >>>> entity (basically relation to entities that are cities, regions,
> >>>> countries) as a separate dimension, because those are very often used
> >>>> for coreferences.
> >>>>
> >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
> >>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
> >>>> [3]
> >>>>
> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
> >>>>
> >>>>
> >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
> >>>> <[email protected]> wrote:
> >>>> > There are several dbpedia categories for each entity, in this case
> for
> >>>> > Microsoft we have :
> >>>> >
> >>>> > category:Companies_in_the_NASDAQ-100_Index
> >>>> > category:Microsoft
> >>>> > category:Software_companies_of_the_United_States
> >>>> > category:Software_companies_based_in_Washington_(state)
> >>>> > category:Companies_established_in_1975
> >>>> > category:1975_establishments_in_the_United_States
> >>>> > category:Companies_based_in_Redmond,_Washington
> >>>> > category:Multinational_companies_headquartered_in_the_United_States
> >>>> > category:Cloud_computing_providers
> >>>> > category:Companies_in_the_Dow_Jones_Industrial_Average
> >>>> >
> >>>> > So we also have "Companies based in Redmont,Washington" which could
> be
> >>>> > matched.
> >>>> >
> >>>> >
> >>>> > There is still other contextual information from dbpedia which can
> be
> >>>> used.
> >>>> > For example for an Organization we could also include :
> >>>> > dbpprop:industry = Software
> >>>> > dbpprop:service = Online Service Providers
> >>>> >
> >>>> > and for a Person (that's for Barack Obama) :
> >>>> >
> >>>> > dbpedia-owl:profession:
> >>>> >                                dbpedia:Author
> >>>> >                                dbpedia:Constitutional_law
> >>>> >                                dbpedia:Lawyer
> >>>> >                                dbpedia:Community_organizing
> >>>> >
> >>>> > I'd like to continue investigating this as I think that it may have
> >>>> some
> >>>> > value in increasing the number of coreference resolutions and I'd
> like
> >>>> to
> >>>> > concentrate more on precision rather than recall since we already
> have
> >>>> a
> >>>> > set of coreferences detected by the stanford nlp tool and this would
> >>>> be as
> >>>> > an addition to that (at least this is how I would like to use it).
> >>>> >
> >>>> > Is it ok if I track this by opening a jira? I could update it to
> show
> >>>> my
> >>>> > progress and also my conclusions and if it turns out that it was a
> bad
> >>>> idea
> >>>> > then that's the situation at least I'll end up with more knowledge
> >>>> about
> >>>> > Stanbol in the end :).
> >>>> >
> >>>> >
> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
> >>>> >
> >>>> >> Hi Cristian,
> >>>> >>
> >>>> >> The approach sounds nice. I don't want to be the devil's advocate
> but
> >>>> I'm
> >>>> >> just not sure about the recall using the dbpedia categories
> feature.
> >>>> For
> >>>> >> example, your sentence could be also "Microsoft posted its 2013
> >>>> earnings.
> >>>> >> The Redmond's company made a huge profit". So, maybe including more
> >>>> >> contextual information from dbpedia could increase the recall but
> of
> >>>> course
> >>>> >> will reduce the precision.
> >>>> >>
> >>>> >> Cheers,
> >>>> >> Rafa
> >>>> >>
> >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
> >>>> >>
> >>>> >>  Back with a more detailed description of the steps for making this
> >>>> kind of
> >>>> >>> coreference work.
> >>>> >>>
> >>>> >>> I will be using references to the following text in the steps
> below
> >>>> in
> >>>> >>> order to make things clearer : "Microsoft posted its 2013
> earnings.
> >>>> The
> >>>> >>> software company made a huge profit."
> >>>> >>>
> >>>> >>> 1. For every noun phrase in the text which has :
> >>>> >>>      a. a determinate pos which implies reference to an entity
> local
> >>>> to
> >>>> >>> the
> >>>> >>> text, such as "the, this, these") but not "another, every", etc
> which
> >>>> >>> implies a reference to an entity outside of the text.
> >>>> >>>      b. having at least another noun aside from the main required
> >>>> noun
> >>>> >>> which
> >>>> >>> further describes it. For example I will not count "The company"
> as
> >>>> being
> >>>> >>> a
> >>>> >>> legitimate candidate since this could create a lot of false
> >>>> positives by
> >>>> >>> considering the double meaning of some words such as "in the
> company
> >>>> of
> >>>> >>> good people".
> >>>> >>> "The software company" is a good candidate since we also have
> >>>> "software".
> >>>> >>>
> >>>> >>> 2. match the nouns in the noun phrase to the contents of the
> dbpedia
> >>>> >>> categories of each named entity found prior to the location of the
> >>>> noun
> >>>> >>> phrase in the text.
> >>>> >>> The dbpedia categories are in the following format (for Microsoft
> for
> >>>> >>> example) : "Software companies of the United States".
> >>>> >>>   So we try to match "software company" with that.
> >>>> >>> First, as you can see, the main noun in the dbpedia category has a
> >>>> plural
> >>>> >>> form and it's the same for all categories which I saw. I don't
> know
> >>>> if
> >>>> >>> there's an easier way to do this but I thought of applying a
> >>>> lemmatizer on
> >>>> >>> the category and the noun phrase in order for them to have a
> common
> >>>> >>> denominator.This also works if the noun phrase itself has a plural
> >>>> form.
> >>>> >>>
> >>>> >>> Second, I'll need to use for comparison only the words in the
> >>>> category
> >>>> >>> which are themselves nouns and not prepositions or determiners
> such
> >>>> as "of
> >>>> >>> the".This means that I need to pos tag the categories contents as
> >>>> well.
> >>>> >>> I was thinking of running the pos and lemma on the dbpedia
> >>>> categories when
> >>>> >>> building the dbpedia backed entity hub and storing them for later
> >>>> use - I
> >>>> >>> don't know how feasible this is at the moment.
> >>>> >>>
> >>>> >>> After this I can compare each noun in the noun phrase with the
> >>>> equivalent
> >>>> >>> nouns in the categories and based on the number of matches I can
> >>>> create a
> >>>> >>> confidence level.
> >>>> >>>
> >>>> >>> 3. match the noun of the noun phrase with the rdf:type from
> dbpedia
> >>>> of the
> >>>> >>> named entity. If this matches increase the confidence level.
> >>>> >>>
> >>>> >>> 4. If there are multiple named entities which can match a certain
> >>>> noun
> >>>> >>> phrase then link the noun phrase with the closest named entity
> prior
> >>>> to it
> >>>> >>> in the text.
> >>>> >>>
> >>>> >>> What do you think?
> >>>> >>>
> >>>> >>> Cristian
> >>>> >>>
> >>>> >>> 2014-01-31 Cristian Petroaca <[email protected]>:
> >>>> >>>
> >>>> >>>  Hi Rafa,
> >>>> >>>>
> >>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
> >>>> provide
> >>>> >>>> it here so that you guys can give me a feedback on it.
> >>>> >>>>
> >>>> >>>> What are "locality" features?
> >>>> >>>>
> >>>> >>>> I looked at Bart and other coref tools such as ArkRef and
> >>>> CherryPicker
> >>>> >>>> and
> >>>> >>>> they don't provide such a coreference.
> >>>> >>>>
> >>>> >>>> Cristian
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> 2014-01-30 Rafa Haro <[email protected]>:
> >>>> >>>>
> >>>> >>>> Hi Cristian,
> >>>> >>>>
> >>>> >>>>> Without having more details about your concrete heuristic, in my
> >>>> honest
> >>>> >>>>> opinion, such approach could produce a lot of false positives. I
> >>>> don't
> >>>> >>>>> know
> >>>> >>>>> if you are planning to use some "locality" features to detect
> such
> >>>> >>>>> coreferences but you need to take into account that it is quite
> >>>> usual
> >>>> >>>>> that
> >>>> >>>>> coreferenced mentions can occurs even in different paragraphs.
> >>>> Although
> >>>> >>>>> I'm
> >>>> >>>>> not an expert in Natural Language Understanding, I would say it
> is
> >>>> quite
> >>>> >>>>> difficult to get decent precision/recall rates for coreferencing
> >>>> using
> >>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART
> (
> >>>> >>>>> http://www.bart-coref.org/).
> >>>> >>>>>
> >>>> >>>>> Cheers,
> >>>> >>>>> Rafa Haro
> >>>> >>>>>
> >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
> >>>> >>>>>
> >>>> >>>>>   Hi,
> >>>> >>>>>
> >>>> >>>>>> One of the necessary steps for implementing the Event
> extraction
> >>>> Engine
> >>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121is
> >>>> to
> >>>> >>>>>> have
> >>>> >>>>>> coreference resolution in the given text. This is provided now
> >>>> via the
> >>>> >>>>>> stanford-nlp project but as far as I saw this module is
> performing
> >>>> >>>>>> mostly
> >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
> >>>> coreference
> >>>> >>>>>> resolution.
> >>>> >>>>>>
> >>>> >>>>>> In order to get more coreferences from the text I though of
> >>>> creating
> >>>> >>>>>> some
> >>>> >>>>>> logic that would detect this kind of coreference :
> >>>> >>>>>> "Apple reaches new profit heights. The software company just
> >>>> announced
> >>>> >>>>>> its
> >>>> >>>>>> 2013 earnings."
> >>>> >>>>>> Here "The software company" obviously refers to "Apple".
> >>>> >>>>>> So I'd like to detect coreferences of Named Entities which are
> of
> >>>> the
> >>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also
> >>>> have
> >>>> >>>>>> attributes which can be found in the dbpedia categories of the
> >>>> named
> >>>> >>>>>> entity, in this case "software".
> >>>> >>>>>>
> >>>> >>>>>> The detection of coreferences such as "The software company" in
> >>>> the
> >>>> >>>>>> text
> >>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase
> >>>> >>>>>> extraction
> >>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the
> >>>> sentence and
> >>>> >>>>>> picking up only subjects or objects.
> >>>> >>>>>>
> >>>> >>>>>> At this point I'd like to know if this kind of logic would be
> >>>> useful
> >>>> >>>>>> as a
> >>>> >>>>>> separate Enhancement Engine (in case the precision and recall
> are
> >>>> good
> >>>> >>>>>> enough) in Stanbol?
> >>>> >>>>>>
> >>>> >>>>>> Thanks,
> >>>> >>>>>> Cristian
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> | Rupert Westenthaler             [email protected]
> >>>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>>> | A-5500 Bischofshofen
> >>>>
> >>>
> >>>
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to