Thanks. I assume I should get the Named entities using the same but with NlpAnnotations.NER_ANNOTATION?
2014-03-10 13:29 GMT+02:00 Rupert Westenthaler < rupert.westentha...@gmail.com>: > Hallo Cristian, > > NounPhrases are not added to the RDF enhancement results. You need to > use the AnalyzedText ContentPart [1] > > here is some demo code you can use in the computeEnhancement method > > AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true); > Iterator<? extends Section> sections = at.getSentences(); > if(!sections.hasNext()){ //process as single sentence > sections = Collections.singleton(at).iterator(); > } > > while(sections.hasNext()){ > Section section = sections.next(); > Iterator<Span> chunks = > section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk)); > while(chunks.hasNext()){ > Span chunk = chunks.next(); > Value<PhraseTag> phrase = > chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION); > if(phrase.value().getCategory() == LexicalCategory.Noun){ > log.info(" - NounPhrase [{},{}] {}", new Object[]{ > > chunk.getStart(),chunk.getEnd(),chunk.getSpan()}); > } > } > } > > hope this helps > > best > Rupert > > [1] > http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext > > On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca > <cristian.petro...@gmail.com> wrote: > > I started to implement the engine and I'm having problems with getting > > results for noun phrases. I modified the "default" weighted chain to also > > include the PosChunkerEngine and ran a sample text : "Angela Merkel > visted > > China. The german chancellor met with various people". I expected that > the > > RDF XML output would contain some info about the noun phrases but I > cannot > > see any. > > Could you point me to the correct way to generate the noun phrases? > > > > Thanks, > > Cristian > > > > > > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca < > cristian.petro...@gmail.com>: > > > >> Opened https://issues.apache.org/jira/browse/STANBOL-1279 > >> > >> > >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca < > cristian.petro...@gmail.com> > >> : > >> > >> Hi Rupert, > >>> > >>> The "spatial" dimension is a good idea. I'll also take a look at Yago. > >>> > >>> I will create a Jira with what we talked about here. It will probably > >>> have just a draft-like description for now and will be updated as I go > >>> along. > >>> > >>> Thanks, > >>> Cristian > >>> > >>> > >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler < > >>> rupert.westentha...@gmail.com>: > >>> > >>> Hi Cristian, > >>>> > >>>> definitely an interesting approach. You should have a look at Yago2 > >>>> [1]. As far as I can remember the Yago taxonomy is much better > >>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia > >>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide > >>>> mappings [2] and [3] > >>>> > >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: > >>>> >> > >>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a > >>>> >> huge profit". > >>>> > >>>> Thats actually a very good example. Spatial contexts are very > >>>> important as they tend to be often used for referencing. So I would > >>>> suggest to specially treat the spatial context. For spatial Entities > >>>> (like a City) this is easy, but even for other (like a Person, > >>>> Company) you could use relations to spatial entities define their > >>>> spatial context. This context could than be used to correctly link > >>>> "The Redmond's company" to "Microsoft". > >>>> > >>>> In addition I would suggest to use the "spatial" context of each > >>>> entity (basically relation to entities that are cities, regions, > >>>> countries) as a separate dimension, because those are very often used > >>>> for coreferences. > >>>> > >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/ > >>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2 > >>>> [3] > >>>> > http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z > >>>> > >>>> > >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca > >>>> <cristian.petro...@gmail.com> wrote: > >>>> > There are several dbpedia categories for each entity, in this case > for > >>>> > Microsoft we have : > >>>> > > >>>> > category:Companies_in_the_NASDAQ-100_Index > >>>> > category:Microsoft > >>>> > category:Software_companies_of_the_United_States > >>>> > category:Software_companies_based_in_Washington_(state) > >>>> > category:Companies_established_in_1975 > >>>> > category:1975_establishments_in_the_United_States > >>>> > category:Companies_based_in_Redmond,_Washington > >>>> > category:Multinational_companies_headquartered_in_the_United_States > >>>> > category:Cloud_computing_providers > >>>> > category:Companies_in_the_Dow_Jones_Industrial_Average > >>>> > > >>>> > So we also have "Companies based in Redmont,Washington" which could > be > >>>> > matched. > >>>> > > >>>> > > >>>> > There is still other contextual information from dbpedia which can > be > >>>> used. > >>>> > For example for an Organization we could also include : > >>>> > dbpprop:industry = Software > >>>> > dbpprop:service = Online Service Providers > >>>> > > >>>> > and for a Person (that's for Barack Obama) : > >>>> > > >>>> > dbpedia-owl:profession: > >>>> > dbpedia:Author > >>>> > dbpedia:Constitutional_law > >>>> > dbpedia:Lawyer > >>>> > dbpedia:Community_organizing > >>>> > > >>>> > I'd like to continue investigating this as I think that it may have > >>>> some > >>>> > value in increasing the number of coreference resolutions and I'd > like > >>>> to > >>>> > concentrate more on precision rather than recall since we already > have > >>>> a > >>>> > set of coreferences detected by the stanford nlp tool and this would > >>>> be as > >>>> > an addition to that (at least this is how I would like to use it). > >>>> > > >>>> > Is it ok if I track this by opening a jira? I could update it to > show > >>>> my > >>>> > progress and also my conclusions and if it turns out that it was a > bad > >>>> idea > >>>> > then that's the situation at least I'll end up with more knowledge > >>>> about > >>>> > Stanbol in the end :). > >>>> > > >>>> > > >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: > >>>> > > >>>> >> Hi Cristian, > >>>> >> > >>>> >> The approach sounds nice. I don't want to be the devil's advocate > but > >>>> I'm > >>>> >> just not sure about the recall using the dbpedia categories > feature. > >>>> For > >>>> >> example, your sentence could be also "Microsoft posted its 2013 > >>>> earnings. > >>>> >> The Redmond's company made a huge profit". So, maybe including more > >>>> >> contextual information from dbpedia could increase the recall but > of > >>>> course > >>>> >> will reduce the precision. > >>>> >> > >>>> >> Cheers, > >>>> >> Rafa > >>>> >> > >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió: > >>>> >> > >>>> >> Back with a more detailed description of the steps for making this > >>>> kind of > >>>> >>> coreference work. > >>>> >>> > >>>> >>> I will be using references to the following text in the steps > below > >>>> in > >>>> >>> order to make things clearer : "Microsoft posted its 2013 > earnings. > >>>> The > >>>> >>> software company made a huge profit." > >>>> >>> > >>>> >>> 1. For every noun phrase in the text which has : > >>>> >>> a. a determinate pos which implies reference to an entity > local > >>>> to > >>>> >>> the > >>>> >>> text, such as "the, this, these") but not "another, every", etc > which > >>>> >>> implies a reference to an entity outside of the text. > >>>> >>> b. having at least another noun aside from the main required > >>>> noun > >>>> >>> which > >>>> >>> further describes it. For example I will not count "The company" > as > >>>> being > >>>> >>> a > >>>> >>> legitimate candidate since this could create a lot of false > >>>> positives by > >>>> >>> considering the double meaning of some words such as "in the > company > >>>> of > >>>> >>> good people". > >>>> >>> "The software company" is a good candidate since we also have > >>>> "software". > >>>> >>> > >>>> >>> 2. match the nouns in the noun phrase to the contents of the > dbpedia > >>>> >>> categories of each named entity found prior to the location of the > >>>> noun > >>>> >>> phrase in the text. > >>>> >>> The dbpedia categories are in the following format (for Microsoft > for > >>>> >>> example) : "Software companies of the United States". > >>>> >>> So we try to match "software company" with that. > >>>> >>> First, as you can see, the main noun in the dbpedia category has a > >>>> plural > >>>> >>> form and it's the same for all categories which I saw. I don't > know > >>>> if > >>>> >>> there's an easier way to do this but I thought of applying a > >>>> lemmatizer on > >>>> >>> the category and the noun phrase in order for them to have a > common > >>>> >>> denominator.This also works if the noun phrase itself has a plural > >>>> form. > >>>> >>> > >>>> >>> Second, I'll need to use for comparison only the words in the > >>>> category > >>>> >>> which are themselves nouns and not prepositions or determiners > such > >>>> as "of > >>>> >>> the".This means that I need to pos tag the categories contents as > >>>> well. > >>>> >>> I was thinking of running the pos and lemma on the dbpedia > >>>> categories when > >>>> >>> building the dbpedia backed entity hub and storing them for later > >>>> use - I > >>>> >>> don't know how feasible this is at the moment. > >>>> >>> > >>>> >>> After this I can compare each noun in the noun phrase with the > >>>> equivalent > >>>> >>> nouns in the categories and based on the number of matches I can > >>>> create a > >>>> >>> confidence level. > >>>> >>> > >>>> >>> 3. match the noun of the noun phrase with the rdf:type from > dbpedia > >>>> of the > >>>> >>> named entity. If this matches increase the confidence level. > >>>> >>> > >>>> >>> 4. If there are multiple named entities which can match a certain > >>>> noun > >>>> >>> phrase then link the noun phrase with the closest named entity > prior > >>>> to it > >>>> >>> in the text. > >>>> >>> > >>>> >>> What do you think? > >>>> >>> > >>>> >>> Cristian > >>>> >>> > >>>> >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: > >>>> >>> > >>>> >>> Hi Rafa, > >>>> >>>> > >>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll > >>>> provide > >>>> >>>> it here so that you guys can give me a feedback on it. > >>>> >>>> > >>>> >>>> What are "locality" features? > >>>> >>>> > >>>> >>>> I looked at Bart and other coref tools such as ArkRef and > >>>> CherryPicker > >>>> >>>> and > >>>> >>>> they don't provide such a coreference. > >>>> >>>> > >>>> >>>> Cristian > >>>> >>>> > >>>> >>>> > >>>> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>: > >>>> >>>> > >>>> >>>> Hi Cristian, > >>>> >>>> > >>>> >>>>> Without having more details about your concrete heuristic, in my > >>>> honest > >>>> >>>>> opinion, such approach could produce a lot of false positives. I > >>>> don't > >>>> >>>>> know > >>>> >>>>> if you are planning to use some "locality" features to detect > such > >>>> >>>>> coreferences but you need to take into account that it is quite > >>>> usual > >>>> >>>>> that > >>>> >>>>> coreferenced mentions can occurs even in different paragraphs. > >>>> Although > >>>> >>>>> I'm > >>>> >>>>> not an expert in Natural Language Understanding, I would say it > is > >>>> quite > >>>> >>>>> difficult to get decent precision/recall rates for coreferencing > >>>> using > >>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART > ( > >>>> >>>>> http://www.bart-coref.org/). > >>>> >>>>> > >>>> >>>>> Cheers, > >>>> >>>>> Rafa Haro > >>>> >>>>> > >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió: > >>>> >>>>> > >>>> >>>>> Hi, > >>>> >>>>> > >>>> >>>>>> One of the necessary steps for implementing the Event > extraction > >>>> Engine > >>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121is > >>>> to > >>>> >>>>>> have > >>>> >>>>>> coreference resolution in the given text. This is provided now > >>>> via the > >>>> >>>>>> stanford-nlp project but as far as I saw this module is > performing > >>>> >>>>>> mostly > >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) > >>>> coreference > >>>> >>>>>> resolution. > >>>> >>>>>> > >>>> >>>>>> In order to get more coreferences from the text I though of > >>>> creating > >>>> >>>>>> some > >>>> >>>>>> logic that would detect this kind of coreference : > >>>> >>>>>> "Apple reaches new profit heights. The software company just > >>>> announced > >>>> >>>>>> its > >>>> >>>>>> 2013 earnings." > >>>> >>>>>> Here "The software company" obviously refers to "Apple". > >>>> >>>>>> So I'd like to detect coreferences of Named Entities which are > of > >>>> the > >>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also > >>>> have > >>>> >>>>>> attributes which can be found in the dbpedia categories of the > >>>> named > >>>> >>>>>> entity, in this case "software". > >>>> >>>>>> > >>>> >>>>>> The detection of coreferences such as "The software company" in > >>>> the > >>>> >>>>>> text > >>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase > >>>> >>>>>> extraction > >>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the > >>>> sentence and > >>>> >>>>>> picking up only subjects or objects. > >>>> >>>>>> > >>>> >>>>>> At this point I'd like to know if this kind of logic would be > >>>> useful > >>>> >>>>>> as a > >>>> >>>>>> separate Enhancement Engine (in case the precision and recall > are > >>>> good > >>>> >>>>>> enough) in Stanbol? > >>>> >>>>>> > >>>> >>>>>> Thanks, > >>>> >>>>>> Cristian > >>>> >>>>>> > >>>> >>>>>> > >>>> >>>>>> > >>>> >> > >>>> > >>>> > >>>> > >>>> -- > >>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >>>> | Bodenlehenstraße 11 ++43-699-11108907 > >>>> | A-5500 Bischofshofen > >>>> > >>> > >>> > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >