Re: Text dependent analyzer

Shay Hummel Mon, 20 Apr 2015 07:31:42 -0700

Hi Rich

Thank you very much,
I understand your solution and will try to do something in that spirit.


Shay

On Fri, Apr 17, 2015 at 8:35 PM Rich Cariens <[email protected]> wrote:

> Ahoy, ahoy!
>
> I was playing around with something similar for indexing multi-lingual
> documents, Shay. The code is up on github
> <https://github.com/whateverdood/cross-lingual-search> and needs
> attention, but you're welcome to see if anything in there helps. The basic
> idea is this:
>
>    1. A custom CharFilter uses the ICU4J sentence BreakIterator to get
>    sentences out of the char stream.
>       1. Each sentence is lang-id'd using the cybozu Detector, and a
>       thread-local (ugh)
>       2. A ThreadLocal (ugh) is updated to with languages and their
>       offsets (where a run of a particular language ends)
>    2. A custom Filter then marks each token with it's language (relying
>    on that ThreadLocal) if possible so the next custom Filter
>    3. ...checks the tokens language and recruits the appropriate stemmer.
>    4. Other Filters like ICUFoldingFilter kick in to do their thing,
>
> Does this help at all?
>
> On Fri, Apr 17, 2015 at 1:06 PM, Benson Margulies <[email protected]>
> wrote:
>
>> If you wait tokenization to depend on sentences, and you insist on
>> being inside Lucene, you have to be a Tokenizer. Your tokenizer can
>> set an attribute on the token that ends a sentence. Then, downstream,
>> filters can  read-ahead tokens to get the full sentence and buffer
>> tokens as needed.
>>
>>
>>
>> On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan <[email protected]>
>> wrote:
>> > Hi Hummel,
>> >
>> > There was an effort to bring open-nlp capabilities to Lucene:
>> > https://issues.apache.org/jira/browse/LUCENE-2899
>> >
>> > Lance was working on it to keep it up-to-date. But, it looks like it is
>> not always best to accomplish all things inside Lucene.
>> > I personally would do the sentence detection outside of the Lucene.
>> >
>> > By the way, I remember there was a way to consume all upstream token
>> stream.
>> >
>> > I think it was consuming all input and injecting one concatenated huge
>> term/token.
>> >
>> > KeywordTokenizer has similar behaviour. It injects a single token.
>> >
>> http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
>> >
>> > Ahmet
>> >
>> >
>> > On Wednesday, April 15, 2015 3:12 PM, Shay Hummel <
>> [email protected]> wrote:
>> > Hi Ahment,
>> > Thank you for the reply,
>> > That's exactly what I am doing. At the moment, to index a document, I
>> break
>> > it to sentences, and each sentence is analyzed (lemmatizing, stopword
>> > removal etc.)
>> > Now, what I am looking for is a way to create an analyzer (a class which
>> > extends lucene's analyzer). This analyzer will be used for index and
>> query
>> > processing. It (a like the english analyzer) will receive the text and
>> > produce tokens.
>> > The Api of Analyzer requires implementing the createComponents which
>> > is not dependent
>> > on the text being analyzed. This fact is problematic since as you know
>> the
>> > OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
>> > model files to provide spans of each sentence and then break them).
>> > Is there a way around it?
>> >
>> > Shay
>> >
>> >
>> > On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan <[email protected]
>> >
>> > wrote:
>> >
>> >> Hi Hummel,
>> >>
>> >> You can perform sentence detection outside of the solr, using opennlp
>> for
>> >> instance, and then feed them to solr.
>> >>
>> >>
>> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>> >>
>> >> Ahmet
>> >>
>> >>
>> >>
>> >>
>> >> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel <[email protected]
>> >
>> >> wrote:
>> >> Hi
>> >> I would like to create a text dependent analyzer.
>> >> That is, *given a string*, the analyzer will:
>> >> 1. Read the entire text and break it into sentences.
>> >> 2. Each sentence will then be tokenized, possesive removal, lowercased,
>> >> mark terms and stemmed.
>> >>
>> >> The second part is essentially what happens in english analyzer
>> >> (createComponent). However, this is not dependent of the text it
>> receives -
>> >> which is the first part of what I am trying to do.
>> >>
>> >> So ... How can it be achieved?
>> >>
>> >> Thank you,
>> >>
>> >> Shay Hummel
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Text dependent analyzer

Reply via email to