Hi Rich Thank you very much, I understand your solution and will try to do something in that spirit.
Shay On Fri, Apr 17, 2015 at 8:35 PM Rich Cariens <richcari...@gmail.com> wrote: > Ahoy, ahoy! > > I was playing around with something similar for indexing multi-lingual > documents, Shay. The code is up on github > <https://github.com/whateverdood/cross-lingual-search> and needs > attention, but you're welcome to see if anything in there helps. The basic > idea is this: > > 1. A custom CharFilter uses the ICU4J sentence BreakIterator to get > sentences out of the char stream. > 1. Each sentence is lang-id'd using the cybozu Detector, and a > thread-local (ugh) > 2. A ThreadLocal (ugh) is updated to with languages and their > offsets (where a run of a particular language ends) > 2. A custom Filter then marks each token with it's language (relying > on that ThreadLocal) if possible so the next custom Filter > 3. ...checks the tokens language and recruits the appropriate stemmer. > 4. Other Filters like ICUFoldingFilter kick in to do their thing, > > Does this help at all? > > On Fri, Apr 17, 2015 at 1:06 PM, Benson Margulies <ben...@basistech.com> > wrote: > >> If you wait tokenization to depend on sentences, and you insist on >> being inside Lucene, you have to be a Tokenizer. Your tokenizer can >> set an attribute on the token that ends a sentence. Then, downstream, >> filters can read-ahead tokens to get the full sentence and buffer >> tokens as needed. >> >> >> >> On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <iori...@yahoo.com.invalid> >> wrote: >> > Hi Hummel, >> > >> > There was an effort to bring open-nlp capabilities to Lucene: >> > https://issues.apache.org/jira/browse/LUCENE-2899 >> > >> > Lance was working on it to keep it up-to-date. But, it looks like it is >> not always best to accomplish all things inside Lucene. >> > I personally would do the sentence detection outside of the Lucene. >> > >> > By the way, I remember there was a way to consume all upstream token >> stream. >> > >> > I think it was consuming all input and injecting one concatenated huge >> term/token. >> > >> > KeywordTokenizer has similar behaviour. It injects a single token. >> > >> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html >> > >> > Ahmet >> > >> > >> > On Wednesday, April 15, 2015 3:12 PM, Shay Hummel < >> shay.hum...@gmail.com> wrote: >> > Hi Ahment, >> > Thank you for the reply, >> > That's exactly what I am doing. At the moment, to index a document, I >> break >> > it to sentences, and each sentence is analyzed (lemmatizing, stopword >> > removal etc.) >> > Now, what I am looking for is a way to create an analyzer (a class which >> > extends lucene's analyzer). This analyzer will be used for index and >> query >> > processing. It (a like the english analyzer) will receive the text and >> > produce tokens. >> > The Api of Analyzer requires implementing the createComponents which >> > is not dependent >> > on the text being analyzed. This fact is problematic since as you know >> the >> > OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the >> > model files to provide spans of each sentence and then break them). >> > Is there a way around it? >> > >> > Shay >> > >> > >> > On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <iori...@yahoo.com.invalid >> > >> > wrote: >> > >> >> Hi Hummel, >> >> >> >> You can perform sentence detection outside of the solr, using opennlp >> for >> >> instance, and then feed them to solr. >> >> >> >> >> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect >> >> >> >> Ahmet >> >> >> >> >> >> >> >> >> >> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <shay.hum...@gmail.com >> > >> >> wrote: >> >> Hi >> >> I would like to create a text dependent analyzer. >> >> That is, *given a string*, the analyzer will: >> >> 1. Read the entire text and break it into sentences. >> >> 2. Each sentence will then be tokenized, possesive removal, lowercased, >> >> mark terms and stemmed. >> >> >> >> The second part is essentially what happens in english analyzer >> >> (createComponent). However, this is not dependent of the text it >> receives - >> >> which is the first part of what I am trying to do. >> >> >> >> So ... How can it be achieved? >> >> >> >> Thank you, >> >> >> >> Shay Hummel >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >