I now tested a version which doesn't have that change and that one is also
broken, must have been caused by another commit.

Jörn



On Mon, May 15, 2017 at 11:24 AM, Richard Eckart de Castilho <r...@apache.org
> wrote:

> Hi Jörn,
>
> unfortunately, I don't have a comprehensive unit test for the parser.
> But from my perspective, it looks like there is something more serious
> going on than just a bit of parse tree reordering when suddenly a
> determiner (DT, "a") is tagged as a punctuation mark (, ","):
>
> 1.7.2: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(NP (NP (DT a)*!* (ADJP
>        (RB very) (VBN complicated)) (NN example) (NN sentence))
>        (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP
>        (IN as) (NP (NP (JJ many) (NNS constituents) (CC and)
>        (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .)))
>
> 1.8.0: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(, a) (NP (NP*!* (ADJP
>        (RB very) (VBN complicated)) (NN example) (NN sentence))
>        (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP
>        (IN as) (NP (NP (JJ many) (NNS constituents) (CC and)
>        (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .)))
>
> IMHO punctuation marks and closed word classes like determiners should be
> pretty stable in their labels. There is usually no need for the parser to
> invent a tag and the labels seen in the training data should be sufficient.
>
> Could there maybe be a problem with duplicates being dropped silently
> by the move from the ListHeap to the TreeSet? If duplicate removal
> is not important, then maybe sorting the heap after it has been filled
> would be a better option than using a permanently sorted and de-duplicating
> data structure.
>
> Cheers,
>
> -- Richard
>
> > On 15.05.2017, at 10:39, Joern Kottmann <kottm...@gmail.com> wrote:
> >
> > Hello Richard,
> >
> > thanks for reporting this. For 1.8.0 we replaced a Heap with a SortedSet
> > [1]. In this commit there is one loop [2] which iterates through the
> parses
> > which will be advanced. The order of the Parsers in the Heap was not so
> > well defined, therefore we decided to sort them by probability.
> > We also noticed that this change is changing the output of the parser
> with
> > the existing models in our SourceForge model eval test [3].
> >
> > After running the evaluation on the OntoNotes4 data set I only got  very
> > small change and decided it is ok to do this. I am not aware of how big
> the
> > change is but is was less than the delta in test case [4] of 0.001.
> >
> > What do you think? Should this be rolled back?
> >
> > Anyway, that said, about the parser, I still need to understand what
> > happened with the lemmatizer.
> >
> > Jörn
> >
> > [1]
> > https://github.com/apache/opennlp/commit/3df659b9bfb02084e78
> 2f1e8b6ec716f56e0611c
> > [2]
> > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f
> 1e8b6ec716f56e0611c/opennlp-tools/src/main/java/opennlp/tools/parser/
> AbstractBottomUpParser.java#L285
> > [3]
> > https://github.com/apache/opennlp/commit/3df659b9bfb02084e78
> 2f1e8b6ec716f56e0611c#diff-a5834f32b8a41b76a336126e4b13d4f7L349
> > [4]
> > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f
> 1e8b6ec716f56e0611c/opennlp-tools/src/test/java/opennlp/
> tools/eval/OntoNotes4ParserEval.java#L70
>
>

Reply via email to