On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir <rcm...@gmail.com> wrote: > its the latter. the way its designed to work i think is illustrated > best in kuromoji analyzer where it heuristically decompounds nouns: > > if it decompounds ABCD into AB + CD, then the tokens are AB and CD. > these both have posinc=1. > however (to compensate for precision issue you mentioned on the other > thread), it keeps the full compound as a synonym too (there are some > papers benchmarking this approach for decompounding, just think of IDF > etc sorting things out). > so that ABCD synonym has position increment 0, and it "sits" at the > same position as the first token (AB). but it has positionLength=2, > which basically keeps the information in the chain that this "synonym" > spans across both AB and CD. > > so the output is like this: AB(posinc=1,posLength=1), > ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)
I suppose this works best if you actually know the offsets of the pieces. In disassembling German, this is not always straightforward. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org