I started to implement the engine and I'm having problems with getting results for noun phrases. I modified the "default" weighted chain to also include the PosChunkerEngine and ran a sample text : "Angela Merkel visted China. The german chancellor met with various people". I expected that the RDF XML output would contain some info about the noun phrases but I cannot see any. Could you point me to the correct way to generate the noun phrases?
Thanks, Cristian 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com>: > Opened https://issues.apache.org/jira/browse/STANBOL-1279 > > > 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com> > : > > Hi Rupert, >> >> The "spatial" dimension is a good idea. I'll also take a look at Yago. >> >> I will create a Jira with what we talked about here. It will probably >> have just a draft-like description for now and will be updated as I go >> along. >> >> Thanks, >> Cristian >> >> >> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler < >> rupert.westentha...@gmail.com>: >> >> Hi Cristian, >>> >>> definitely an interesting approach. You should have a look at Yago2 >>> [1]. As far as I can remember the Yago taxonomy is much better >>> structured as the one used by dbpedia. Mapping suggestions of dbpedia >>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide >>> mappings [2] and [3] >>> >>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >>> >> >>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a >>> >> huge profit". >>> >>> Thats actually a very good example. Spatial contexts are very >>> important as they tend to be often used for referencing. So I would >>> suggest to specially treat the spatial context. For spatial Entities >>> (like a City) this is easy, but even for other (like a Person, >>> Company) you could use relations to spatial entities define their >>> spatial context. This context could than be used to correctly link >>> "The Redmond's company" to "Microsoft". >>> >>> In addition I would suggest to use the "spatial" context of each >>> entity (basically relation to entities that are cities, regions, >>> countries) as a separate dimension, because those are very often used >>> for coreferences. >>> >>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/ >>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2 >>> [3] >>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z >>> >>> >>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca >>> <cristian.petro...@gmail.com> wrote: >>> > There are several dbpedia categories for each entity, in this case for >>> > Microsoft we have : >>> > >>> > category:Companies_in_the_NASDAQ-100_Index >>> > category:Microsoft >>> > category:Software_companies_of_the_United_States >>> > category:Software_companies_based_in_Washington_(state) >>> > category:Companies_established_in_1975 >>> > category:1975_establishments_in_the_United_States >>> > category:Companies_based_in_Redmond,_Washington >>> > category:Multinational_companies_headquartered_in_the_United_States >>> > category:Cloud_computing_providers >>> > category:Companies_in_the_Dow_Jones_Industrial_Average >>> > >>> > So we also have "Companies based in Redmont,Washington" which could be >>> > matched. >>> > >>> > >>> > There is still other contextual information from dbpedia which can be >>> used. >>> > For example for an Organization we could also include : >>> > dbpprop:industry = Software >>> > dbpprop:service = Online Service Providers >>> > >>> > and for a Person (that's for Barack Obama) : >>> > >>> > dbpedia-owl:profession: >>> > dbpedia:Author >>> > dbpedia:Constitutional_law >>> > dbpedia:Lawyer >>> > dbpedia:Community_organizing >>> > >>> > I'd like to continue investigating this as I think that it may have >>> some >>> > value in increasing the number of coreference resolutions and I'd like >>> to >>> > concentrate more on precision rather than recall since we already have >>> a >>> > set of coreferences detected by the stanford nlp tool and this would >>> be as >>> > an addition to that (at least this is how I would like to use it). >>> > >>> > Is it ok if I track this by opening a jira? I could update it to show >>> my >>> > progress and also my conclusions and if it turns out that it was a bad >>> idea >>> > then that's the situation at least I'll end up with more knowledge >>> about >>> > Stanbol in the end :). >>> > >>> > >>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >>> > >>> >> Hi Cristian, >>> >> >>> >> The approach sounds nice. I don't want to be the devil's advocate but >>> I'm >>> >> just not sure about the recall using the dbpedia categories feature. >>> For >>> >> example, your sentence could be also "Microsoft posted its 2013 >>> earnings. >>> >> The Redmond's company made a huge profit". So, maybe including more >>> >> contextual information from dbpedia could increase the recall but of >>> course >>> >> will reduce the precision. >>> >> >>> >> Cheers, >>> >> Rafa >>> >> >>> >> El 04/02/14 09:50, Cristian Petroaca escribió: >>> >> >>> >> Back with a more detailed description of the steps for making this >>> kind of >>> >>> coreference work. >>> >>> >>> >>> I will be using references to the following text in the steps below >>> in >>> >>> order to make things clearer : "Microsoft posted its 2013 earnings. >>> The >>> >>> software company made a huge profit." >>> >>> >>> >>> 1. For every noun phrase in the text which has : >>> >>> a. a determinate pos which implies reference to an entity local >>> to >>> >>> the >>> >>> text, such as "the, this, these") but not "another, every", etc which >>> >>> implies a reference to an entity outside of the text. >>> >>> b. having at least another noun aside from the main required >>> noun >>> >>> which >>> >>> further describes it. For example I will not count "The company" as >>> being >>> >>> a >>> >>> legitimate candidate since this could create a lot of false >>> positives by >>> >>> considering the double meaning of some words such as "in the company >>> of >>> >>> good people". >>> >>> "The software company" is a good candidate since we also have >>> "software". >>> >>> >>> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia >>> >>> categories of each named entity found prior to the location of the >>> noun >>> >>> phrase in the text. >>> >>> The dbpedia categories are in the following format (for Microsoft for >>> >>> example) : "Software companies of the United States". >>> >>> So we try to match "software company" with that. >>> >>> First, as you can see, the main noun in the dbpedia category has a >>> plural >>> >>> form and it's the same for all categories which I saw. I don't know >>> if >>> >>> there's an easier way to do this but I thought of applying a >>> lemmatizer on >>> >>> the category and the noun phrase in order for them to have a common >>> >>> denominator.This also works if the noun phrase itself has a plural >>> form. >>> >>> >>> >>> Second, I'll need to use for comparison only the words in the >>> category >>> >>> which are themselves nouns and not prepositions or determiners such >>> as "of >>> >>> the".This means that I need to pos tag the categories contents as >>> well. >>> >>> I was thinking of running the pos and lemma on the dbpedia >>> categories when >>> >>> building the dbpedia backed entity hub and storing them for later >>> use - I >>> >>> don't know how feasible this is at the moment. >>> >>> >>> >>> After this I can compare each noun in the noun phrase with the >>> equivalent >>> >>> nouns in the categories and based on the number of matches I can >>> create a >>> >>> confidence level. >>> >>> >>> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia >>> of the >>> >>> named entity. If this matches increase the confidence level. >>> >>> >>> >>> 4. If there are multiple named entities which can match a certain >>> noun >>> >>> phrase then link the noun phrase with the closest named entity prior >>> to it >>> >>> in the text. >>> >>> >>> >>> What do you think? >>> >>> >>> >>> Cristian >>> >>> >>> >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: >>> >>> >>> >>> Hi Rafa, >>> >>>> >>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll >>> provide >>> >>>> it here so that you guys can give me a feedback on it. >>> >>>> >>> >>>> What are "locality" features? >>> >>>> >>> >>>> I looked at Bart and other coref tools such as ArkRef and >>> CherryPicker >>> >>>> and >>> >>>> they don't provide such a coreference. >>> >>>> >>> >>>> Cristian >>> >>>> >>> >>>> >>> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>: >>> >>>> >>> >>>> Hi Cristian, >>> >>>> >>> >>>>> Without having more details about your concrete heuristic, in my >>> honest >>> >>>>> opinion, such approach could produce a lot of false positives. I >>> don't >>> >>>>> know >>> >>>>> if you are planning to use some "locality" features to detect such >>> >>>>> coreferences but you need to take into account that it is quite >>> usual >>> >>>>> that >>> >>>>> coreferenced mentions can occurs even in different paragraphs. >>> Although >>> >>>>> I'm >>> >>>>> not an expert in Natural Language Understanding, I would say it is >>> quite >>> >>>>> difficult to get decent precision/recall rates for coreferencing >>> using >>> >>>>> fixed rules. Maybe you can give a try to others tools like BART ( >>> >>>>> http://www.bart-coref.org/). >>> >>>>> >>> >>>>> Cheers, >>> >>>>> Rafa Haro >>> >>>>> >>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió: >>> >>>>> >>> >>>>> Hi, >>> >>>>> >>> >>>>>> One of the necessary steps for implementing the Event extraction >>> Engine >>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is >>> to >>> >>>>>> have >>> >>>>>> coreference resolution in the given text. This is provided now >>> via the >>> >>>>>> stanford-nlp project but as far as I saw this module is performing >>> >>>>>> mostly >>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) >>> coreference >>> >>>>>> resolution. >>> >>>>>> >>> >>>>>> In order to get more coreferences from the text I though of >>> creating >>> >>>>>> some >>> >>>>>> logic that would detect this kind of coreference : >>> >>>>>> "Apple reaches new profit heights. The software company just >>> announced >>> >>>>>> its >>> >>>>>> 2013 earnings." >>> >>>>>> Here "The software company" obviously refers to "Apple". >>> >>>>>> So I'd like to detect coreferences of Named Entities which are of >>> the >>> >>>>>> rdf:type of the Named Entity , in this case "company" and also >>> have >>> >>>>>> attributes which can be found in the dbpedia categories of the >>> named >>> >>>>>> entity, in this case "software". >>> >>>>>> >>> >>>>>> The detection of coreferences such as "The software company" in >>> the >>> >>>>>> text >>> >>>>>> would also be done by either using the new Pos Tag Based Phrase >>> >>>>>> extraction >>> >>>>>> Engine (noun phrases) or by using a dependency tree of the >>> sentence and >>> >>>>>> picking up only subjects or objects. >>> >>>>>> >>> >>>>>> At this point I'd like to know if this kind of logic would be >>> useful >>> >>>>>> as a >>> >>>>>> separate Enhancement Engine (in case the precision and recall are >>> good >>> >>>>>> enough) in Stanbol? >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> Cristian >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >> >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> >> >> >