I now tested a version which doesn't have that change and that one is also broken, must have been caused by another commit.
Jörn On Mon, May 15, 2017 at 11:24 AM, Richard Eckart de Castilho <r...@apache.org > wrote: > Hi Jörn, > > unfortunately, I don't have a comprehensive unit test for the parser. > But from my perspective, it looks like there is something more serious > going on than just a bit of parse tree reordering when suddenly a > determiner (DT, "a") is tagged as a punctuation mark (, ","): > > 1.7.2: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(NP (NP (DT a)*!* (ADJP > (RB very) (VBN complicated)) (NN example) (NN sentence)) > (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP > (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) > (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .))) > > 1.8.0: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(, a) (NP (NP*!* (ADJP > (RB very) (VBN complicated)) (NN example) (NN sentence)) > (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP > (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) > (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .))) > > IMHO punctuation marks and closed word classes like determiners should be > pretty stable in their labels. There is usually no need for the parser to > invent a tag and the labels seen in the training data should be sufficient. > > Could there maybe be a problem with duplicates being dropped silently > by the move from the ListHeap to the TreeSet? If duplicate removal > is not important, then maybe sorting the heap after it has been filled > would be a better option than using a permanently sorted and de-duplicating > data structure. > > Cheers, > > -- Richard > > > On 15.05.2017, at 10:39, Joern Kottmann <kottm...@gmail.com> wrote: > > > > Hello Richard, > > > > thanks for reporting this. For 1.8.0 we replaced a Heap with a SortedSet > > [1]. In this commit there is one loop [2] which iterates through the > parses > > which will be advanced. The order of the Parsers in the Heap was not so > > well defined, therefore we decided to sort them by probability. > > We also noticed that this change is changing the output of the parser > with > > the existing models in our SourceForge model eval test [3]. > > > > After running the evaluation on the OntoNotes4 data set I only got very > > small change and decided it is ok to do this. I am not aware of how big > the > > change is but is was less than the delta in test case [4] of 0.001. > > > > What do you think? Should this be rolled back? > > > > Anyway, that said, about the parser, I still need to understand what > > happened with the lemmatizer. > > > > Jörn > > > > [1] > > https://github.com/apache/opennlp/commit/3df659b9bfb02084e78 > 2f1e8b6ec716f56e0611c > > [2] > > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f > 1e8b6ec716f56e0611c/opennlp-tools/src/main/java/opennlp/tools/parser/ > AbstractBottomUpParser.java#L285 > > [3] > > https://github.com/apache/opennlp/commit/3df659b9bfb02084e78 > 2f1e8b6ec716f56e0611c#diff-a5834f32b8a41b76a336126e4b13d4f7L349 > > [4] > > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f > 1e8b6ec716f56e0611c/opennlp-tools/src/test/java/opennlp/ > tools/eval/OntoNotes4ParserEval.java#L70 > >