Re: Text dependent analyzer

2015-04-17 Thread Rich Cariens
Ahoy, ahoy! I was playing around with something similar for indexing multi-lingual documents, Shay. The code is up on github https://github.com/whateverdood/cross-lingual-search and needs attention, but you're welcome to see if anything in there helps. The basic idea is this: 1. A custom

Re: Text dependent analyzer

2015-04-17 Thread Ahmet Arslan
Hi Hummel, There was an effort to bring open-nlp capabilities to Lucene: https://issues.apache.org/jira/browse/LUCENE-2899 Lance was working on it to keep it up-to-date. But, it looks like it is not always best to accomplish all things inside Lucene. I personally would do the sentence

Re: Text dependent analyzer

2015-04-17 Thread Benson Margulies
If you wait tokenization to depend on sentences, and you insist on being inside Lucene, you have to be a Tokenizer. Your tokenizer can set an attribute on the token that ends a sentence. Then, downstream, filters can read-ahead tokens to get the full sentence and buffer tokens as needed. On