Re: Text dependent analyzer

2015-04-17 Thread Rich Cariens
Ahoy, ahoy!

I was playing around with something similar for indexing multi-lingual
documents, Shay. The code is up on github
https://github.com/whateverdood/cross-lingual-search and needs attention,
but you're welcome to see if anything in there helps. The basic idea is
this:

   1. A custom CharFilter uses the ICU4J sentence BreakIterator to get
   sentences out of the char stream.
  1. Each sentence is lang-id'd using the cybozu Detector, and a
  thread-local (ugh)
  2. A ThreadLocal (ugh) is updated to with languages and their offsets
  (where a run of a particular language ends)
   2. A custom Filter then marks each token with it's language (relying on
   that ThreadLocal) if possible so the next custom Filter
   3. ...checks the tokens language and recruits the appropriate stemmer.
   4. Other Filters like ICUFoldingFilter kick in to do their thing,

Does this help at all?

On Fri, Apr 17, 2015 at 1:06 PM, Benson Margulies ben...@basistech.com
wrote:

 If you wait tokenization to depend on sentences, and you insist on
 being inside Lucene, you have to be a Tokenizer. Your tokenizer can
 set an attribute on the token that ends a sentence. Then, downstream,
 filters can  read-ahead tokens to get the full sentence and buffer
 tokens as needed.



 On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
  Hi Hummel,
 
  There was an effort to bring open-nlp capabilities to Lucene:
  https://issues.apache.org/jira/browse/LUCENE-2899
 
  Lance was working on it to keep it up-to-date. But, it looks like it is
 not always best to accomplish all things inside Lucene.
  I personally would do the sentence detection outside of the Lucene.
 
  By the way, I remember there was a way to consume all upstream token
 stream.
 
  I think it was consuming all input and injecting one concatenated huge
 term/token.
 
  KeywordTokenizer has similar behaviour. It injects a single token.
 
 http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html
 
  Ahmet
 
 
  On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com
 wrote:
  Hi Ahment,
  Thank you for the reply,
  That's exactly what I am doing. At the moment, to index a document, I
 break
  it to sentences, and each sentence is analyzed (lemmatizing, stopword
  removal etc.)
  Now, what I am looking for is a way to create an analyzer (a class which
  extends lucene's analyzer). This analyzer will be used for index and
 query
  processing. It (a like the english analyzer) will receive the text and
  produce tokens.
  The Api of Analyzer requires implementing the createComponents which
  is not dependent
  on the text being analyzed. This fact is problematic since as you know
 the
  OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
  model files to provide spans of each sentence and then break them).
  Is there a way around it?
 
  Shay
 
 
  On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid
  wrote:
 
  Hi Hummel,
 
  You can perform sentence detection outside of the solr, using opennlp
 for
  instance, and then feed them to solr.
 
 
 https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
 
  Ahmet
 
 
 
 
  On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com
  wrote:
  Hi
  I would like to create a text dependent analyzer.
  That is, *given a string*, the analyzer will:
  1. Read the entire text and break it into sentences.
  2. Each sentence will then be tokenized, possesive removal, lowercased,
  mark terms and stemmed.
 
  The second part is essentially what happens in english analyzer
  (createComponent). However, this is not dependent of the text it
 receives -
  which is the first part of what I am trying to do.
 
  So ... How can it be achieved?
 
  Thank you,
 
  Shay Hummel
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Text dependent analyzer

2015-04-17 Thread Ahmet Arslan
Hi Hummel,

There was an effort to bring open-nlp capabilities to Lucene: 
https://issues.apache.org/jira/browse/LUCENE-2899

Lance was working on it to keep it up-to-date. But, it looks like it is not 
always best to accomplish all things inside Lucene.
I personally would do the sentence detection outside of the Lucene.

By the way, I remember there was a way to consume all upstream token stream.

I think it was consuming all input and injecting one concatenated huge 
term/token.

KeywordTokenizer has similar behaviour. It injects a single token.
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html

Ahmet


On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com wrote:
Hi Ahment,
Thank you for the reply,
That's exactly what I am doing. At the moment, to index a document, I break
it to sentences, and each sentence is analyzed (lemmatizing, stopword
removal etc.)
Now, what I am looking for is a way to create an analyzer (a class which
extends lucene's analyzer). This analyzer will be used for index and query
processing. It (a like the english analyzer) will receive the text and
produce tokens.
The Api of Analyzer requires implementing the createComponents which
is not dependent
on the text being analyzed. This fact is problematic since as you know the
OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
model files to provide spans of each sentence and then break them).
Is there a way around it?

Shay


On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Hummel,

 You can perform sentence detection outside of the solr, using opennlp for
 instance, and then feed them to solr.

 https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

 Ahmet




 On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com
 wrote:
 Hi
 I would like to create a text dependent analyzer.
 That is, *given a string*, the analyzer will:
 1. Read the entire text and break it into sentences.
 2. Each sentence will then be tokenized, possesive removal, lowercased,
 mark terms and stemmed.

 The second part is essentially what happens in english analyzer
 (createComponent). However, this is not dependent of the text it receives -
 which is the first part of what I am trying to do.

 So ... How can it be achieved?

 Thank you,

 Shay Hummel

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text dependent analyzer

2015-04-17 Thread Benson Margulies
If you wait tokenization to depend on sentences, and you insist on
being inside Lucene, you have to be a Tokenizer. Your tokenizer can
set an attribute on the token that ends a sentence. Then, downstream,
filters can  read-ahead tokens to get the full sentence and buffer
tokens as needed.



On Fri, Apr 17, 2015 at 1:00 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hi Hummel,

 There was an effort to bring open-nlp capabilities to Lucene:
 https://issues.apache.org/jira/browse/LUCENE-2899

 Lance was working on it to keep it up-to-date. But, it looks like it is not 
 always best to accomplish all things inside Lucene.
 I personally would do the sentence detection outside of the Lucene.

 By the way, I remember there was a way to consume all upstream token stream.

 I think it was consuming all input and injecting one concatenated huge 
 term/token.

 KeywordTokenizer has similar behaviour. It injects a single token.
 http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html

 Ahmet


 On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com 
 wrote:
 Hi Ahment,
 Thank you for the reply,
 That's exactly what I am doing. At the moment, to index a document, I break
 it to sentences, and each sentence is analyzed (lemmatizing, stopword
 removal etc.)
 Now, what I am looking for is a way to create an analyzer (a class which
 extends lucene's analyzer). This analyzer will be used for index and query
 processing. It (a like the english analyzer) will receive the text and
 produce tokens.
 The Api of Analyzer requires implementing the createComponents which
 is not dependent
 on the text being analyzed. This fact is problematic since as you know the
 OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
 model files to provide spans of each sentence and then break them).
 Is there a way around it?

 Shay


 On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

 Hi Hummel,

 You can perform sentence detection outside of the solr, using opennlp for
 instance, and then feed them to solr.

 https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

 Ahmet




 On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com
 wrote:
 Hi
 I would like to create a text dependent analyzer.
 That is, *given a string*, the analyzer will:
 1. Read the entire text and break it into sentences.
 2. Each sentence will then be tokenized, possesive removal, lowercased,
 mark terms and stemmed.

 The second part is essentially what happens in english analyzer
 (createComponent). However, this is not dependent of the text it receives -
 which is the first part of what I am trying to do.

 So ... How can it be achieved?

 Thank you,

 Shay Hummel

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org