Ahoy, ahoy!
I was playing around with something similar for indexing multi-lingual
documents, Shay. The code is up on github
https://github.com/whateverdood/cross-lingual-search and needs attention,
but you're welcome to see if anything in there helps. The basic idea is
this:
1. A custom
Hi Hummel,
There was an effort to bring open-nlp capabilities to Lucene:
https://issues.apache.org/jira/browse/LUCENE-2899
Lance was working on it to keep it up-to-date. But, it looks like it is not
always best to accomplish all things inside Lucene.
I personally would do the sentence
If you wait tokenization to depend on sentences, and you insist on
being inside Lucene, you have to be a Tokenizer. Your tokenizer can
set an attribute on the token that ends a sentence. Then, downstream,
filters can read-ahead tokens to get the full sentence and buffer
tokens as needed.
On