[ https://issues.apache.org/jira/browse/OPENNLP-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Damiano Porta updated OPENNLP-859: ---------------------------------- Description: Hello, I have created the following training data. {code:title=train.txt|borderStyle=solid} Ciao mi chiamo <START:person> Damiano <END> ed abito a Roma . il mio indirizzo è via del <START:person> Corso <END> nella provincia di Roma . il mio cap è lo 00144 nella capitale e e il mio nome è <START:person> john <END> . Abito a Roma in via tar dei tali 10 , <START:person> Mario <END> è il mio amico . Oggi ho incontrato <START:person> giovanni <END> e siamo andati a giocare a calcio . {code} And then this code: {code:title=test.java|borderStyle=solid} Charset charset = Charset.forName("UTF-8"); ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("/home/damiano/person.train"), charset); ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream); TokenNameFinderModel model; Dictionary dictionary = new Dictionary(); dictionary.put(new StringList(new String[]{"giovanni"})); dictionary.put(new StringList(new String[]{"maria"})); dictionary.put(new StringList(new String[]{"luca"})); BufferedOutputStream aa = null; AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator( new AdaptiveFeatureGenerator[]{ new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2), new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2), new OutcomePriorFeatureGenerator(), new PreviousMapFeatureGenerator(), new BigramNameFeatureGenerator(), new SentenceFeatureGenerator(true, false), new DictionaryFeatureGenerator("person", dictionary) }); try { model = NameFinderME.train("it", "person", sampleStream, TrainingParameters.defaultParams(), featureGenerator, Collections.<String, Object>emptyMap()); } finally { sampleStream.close(); } // Save trained model try (BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream("/home/damiano/it-person-custom.bin"))) { model.serialize(modelOut); } // Read the trained model try (InputStream modelIn = new FileInputStream("/home/damiano/it-person-custom.bin")) { TokenNameFinderModel nerModel = new TokenNameFinderModel(modelIn); NameFinderME nameFinder = new NameFinderME(nerModel, featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE); String sentence[] = new String[]{ "Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", "." }; Span nameSpans[] = nameFinder.find(sentence); System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, sentence))); } {code} When i try `"Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", "."` it correctly detect "Damiano" as PERSON, but if i change it with: "Ciao", "mi", "chiamo", "maria", "e", "sono", "di", "Roma", "." it does not detect "maria" as PERSON but I added "maria" in the dictionary so it should get it. Why not ? Thanks! was: Hello, I have created the following training data. ``` Ciao mi chiamo <START:person> Damiano <END> ed abito a Roma . il mio indirizzo è via del <START:person> Corso <END> nella provincia di Roma . il mio cap è lo 00144 nella capitale e e il mio nome è <START:person> john <END> . Abito a Roma in via tar dei tali 10 , <START:person> Mario <END> è il mio amico . Oggi ho incontrato <START:person> giovanni <END> e siamo andati a giocare a calcio . ``` And then this code: ``` Charset charset = Charset.forName("UTF-8"); ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("/home/damiano/person.train"), charset); ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream); TokenNameFinderModel model; Dictionary dictionary = new Dictionary(); dictionary.put(new StringList(new String[]{"giovanni"})); dictionary.put(new StringList(new String[]{"maria"})); dictionary.put(new StringList(new String[]{"luca"})); BufferedOutputStream aa = null; AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator( new AdaptiveFeatureGenerator[]{ new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2), new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2), new OutcomePriorFeatureGenerator(), new PreviousMapFeatureGenerator(), new BigramNameFeatureGenerator(), new SentenceFeatureGenerator(true, false), new DictionaryFeatureGenerator("person", dictionary) }); try { model = NameFinderME.train("it", "person", sampleStream, TrainingParameters.defaultParams(), featureGenerator, Collections.<String, Object>emptyMap()); } finally { sampleStream.close(); } // Save trained model try (BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream("/home/damiano/it-person-custom.bin"))) { model.serialize(modelOut); } // Read the trained model try (InputStream modelIn = new FileInputStream("/home/damiano/it-person-custom.bin")) { TokenNameFinderModel nerModel = new TokenNameFinderModel(modelIn); NameFinderME nameFinder = new NameFinderME(nerModel, featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE); String sentence[] = new String[]{ "Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", "." }; Span nameSpans[] = nameFinder.find(sentence); System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, sentence))); } ``` When i try `"Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", "."` it correctly detect "Damiano" as PERSON, but if i change it with: "Ciao", "mi", "chiamo", "maria", "e", "sono", "di", "Roma", "." it does not detect "maria" as PERSON but I added "maria" in the dictionary so it should get it. Why not ? Thanks! > Cannot get entities from trained model using DictionaryFeatureGenerator > ------------------------------------------------------------------------ > > Key: OPENNLP-859 > URL: https://issues.apache.org/jira/browse/OPENNLP-859 > Project: OpenNLP > Issue Type: Question > Components: Name Finder > Affects Versions: 1.6.0 > Environment: ubuntu 16.04 java 8 > Reporter: Damiano Porta > > Hello, > I have created the following training data. > {code:title=train.txt|borderStyle=solid} > Ciao mi chiamo <START:person> Damiano <END> ed abito a Roma . > il mio indirizzo è via del <START:person> Corso <END> nella provincia di Roma > . > il mio cap è lo 00144 nella capitale e e il mio nome è <START:person> john > <END> . > Abito a Roma in via tar dei tali 10 , <START:person> Mario <END> è il mio > amico . > Oggi ho incontrato <START:person> giovanni <END> e siamo andati a giocare a > calcio . > {code} > And then this code: > {code:title=test.java|borderStyle=solid} > Charset charset = Charset.forName("UTF-8"); > ObjectStream<String> lineStream = > new PlainTextByLineStream(new > FileInputStream("/home/damiano/person.train"), charset); > ObjectStream<NameSample> sampleStream = new > NameSampleDataStream(lineStream); > TokenNameFinderModel model; > Dictionary dictionary = new Dictionary(); > dictionary.put(new StringList(new String[]{"giovanni"})); > dictionary.put(new StringList(new String[]{"maria"})); > dictionary.put(new StringList(new String[]{"luca"})); > > BufferedOutputStream aa = null; > > AdaptiveFeatureGenerator featureGenerator = new > CachedFeatureGenerator( > new AdaptiveFeatureGenerator[]{ > > new WindowFeatureGenerator(new TokenFeatureGenerator(), > 2, 2), > new WindowFeatureGenerator(new > TokenClassFeatureGenerator(true), 2, 2), > new OutcomePriorFeatureGenerator(), > new PreviousMapFeatureGenerator(), > new BigramNameFeatureGenerator(), > new SentenceFeatureGenerator(true, false), > new DictionaryFeatureGenerator("person", dictionary) > }); > try { > model = NameFinderME.train("it", "person", sampleStream, > TrainingParameters.defaultParams(), > featureGenerator, Collections.<String, Object>emptyMap()); > } > finally { > sampleStream.close(); > } > // Save trained model > try (BufferedOutputStream modelOut = new BufferedOutputStream(new > FileOutputStream("/home/damiano/it-person-custom.bin"))) { > model.serialize(modelOut); > } > > // Read the trained model > try (InputStream modelIn = new > FileInputStream("/home/damiano/it-person-custom.bin")) { > TokenNameFinderModel nerModel = new TokenNameFinderModel(modelIn); > NameFinderME nameFinder = new NameFinderME(nerModel, > featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE); > > String sentence[] = new String[]{ > "Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", > "." > }; > > Span nameSpans[] = nameFinder.find(sentence); > > System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, > sentence))); > } > {code} > When i try `"Ciao", "mi", "chiamo", "Damiano", "e", "sono", "di", "Roma", > "."` it correctly detect "Damiano" as PERSON, but if i change it with: > "Ciao", "mi", "chiamo", "maria", "e", "sono", "di", "Roma", "." > it does not detect "maria" as PERSON but I added "maria" in the dictionary so > it should get it. Why not ? > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)