Jörn, Great, that helps quite a lot! It is similar with the main method of that class but your explanations and the trick of integrating the NERs are the real silver bullet of your post here. Thanks a bunch.
I will play with that at night. What I want is to define a few IE algorithms from that. Is there any references of Information Extraction using OpenNLP that you would recommend as well? Thanks, Carlos. On Thu, Jun 14, 2012 at 9:14 AM, Jörn Kottmann <[email protected]> wrote: > Hello, > > the input for the coreference component needs to be preprocessed, > with the sentence detector, tokenizer, parser and name finders. > > You can do this via API and our documentation provides sample code for > each of these steps. > > The only tricky part is the to get the named entities into the parse tree. > Here is a sample: > Parse parse; // returned from parser > Span personEntites[]; // returned from person name model > .... > Parse.addNames("person", personEntites[fi], parse.getTagNodes()); > > After this the person names are inserted into the parse tree, you need > to repeat this step for every entity type you would like to reference. The > "person" > tags are currently hard coded. You can find a list in > TreebankNameFinder.NAME_TYPES > (I believe thats a trunk only class). > > Before you start with the rest you should download all the coreferencer > models for 1.4 > into one directory, similar to the structure on the sever. > > Now we are coming to the coreference resolution code: > Linker treebankLinker = new TreebankLinker("/home/joern/**corefmodel/", > LinkerMode.TEST); > > This will create the linker for you. > > First all the mentions need to be recognized and afterward they are linked > together. > For every sentence you do this: > Parse p = ...; // contains a parse of a sentence with names > Mention[] extents = treebankLinker.**getMentionFinder().**getMentions(new > DefaultParse(p,sentenceNumber)**); > for (int ei=0,en=extents.length;ei<en;**ei++) { > if (extents[ei].getParse() == null) { > Parse snp = new Parse(p.getText(),extents[ei].**getSpan(),"NML",1.0,0); > p.insert(snp); > extents[ei].setParse(new DefaultParse(snp, sentenceNumber)); > } > } > sentenceNumber++; > > The result are the mentions per sentence. All these mention objects should > be copied into a single list, > e.g. via document.addAll(extents) (document is of type List<Mention>). > > Now the mentions of one document can be linked together: > DiscourseEntity[] entities = treebankLinker.getEntities(**document.toArray(new > Mention[document.size()])); > > The entities array now contains the various detected and linked entities, > usually you want to filter out entities > which just have a single mention. The DiscourseEntity groups mentions > together, a mention must not be an > entity, other noun phrases are valid mentions as well. > > Hope that helps, > Jörn > > > > On 06/13/2012 07:41 PM, Carlos Scheidecker wrote: > >> Jörn, >> >> I just want to know how it works for now. I've following the one from >> StanfordNLP as well. >> >> Basically, I want to first know if I just pass raw test to it or if I have >> to tag that first. Looks like I need to do POS tag first. >> >> I want to be able to pass a text and get the references as object lists >> from the API. >> >> So I can fetch the relations. >> >> I still need to take some time here and read more the source code unless >> you have some pointers. >> >> Thanks, >> >> Carlos. >> >> >> >> On Wed, Jun 13, 2012 at 11:23 AM, Jörn Kottmann<[email protected]> >> wrote: >> >> On 06/13/2012 07:07 PM, Carlos Scheidecker wrote: >>> >>> Thanks. So for now we can only use the models from 1.4. I saw that a >>>> training class was added recently. How do you use that? >>>> >>>> Thats still work in progress, on which data do you want to train? >>> >>> You need to produce data in a certain format, there should be a sample in >>> the test folder. >>> Its basically penn treebank style plus some nodes to label the mentions >>> in the tree. >>> >>> The parse trees of a document are grouped and send document wise >>> to the trainer via a stream. After this is done a new model will be >>> trained. >>> >>> The OpenNLP corferencer works currently only on noun phrases, other >>> mentions >>> like verbs will not be resolved (in case you wanna train on OntoNotes). >>> >>> Jörn >>> >>> >>> >>> >
