On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir <rcm...@gmail.com> wrote:
> its the latter. the way its designed to work i think is illustrated
> best in kuromoji analyzer where it heuristically decompounds nouns:
>
> if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
> these both have posinc=1.
> however (to compensate for precision issue you mentioned on the other
> thread), it keeps the full compound as a synonym too (there are some
> papers benchmarking this approach for decompounding, just think of IDF
> etc sorting things out).
> so that ABCD synonym has position increment 0, and it "sits" at the
> same position as the first token (AB). but it has positionLength=2,
> which basically keeps the information in the chain that this "synonym"
> spans across both AB and CD.
>
> so the output is like this: AB(posinc=1,posLength=1),
> ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)

I suppose this works best if you actually know the offsets of the
pieces. In disassembling German, this is not always straightforward.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to