RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond,

On 3/1/2009, Raymond Balmès wrote:
 I'm trying to index ( search later) documents that contain tri-grams
 however they have the following form:
 
 string 2 digit 2 digit
 
 Does the ShingleFilter work with numbers in the match ?

Yes, though it is the tokenizer and previous filters in the chain that will be 
the (potential) source of difficulties, not ShingleFilter.

 Another complication, in future features I'd like to add optional
 digits like
 
 [1 digit] string 2 digit 2 digit
 
 I suppose the ShingleFilter won't do it ?

ShingleFilter just pastes together the tokens produced by the previous 
component in the analysis chain, in a sliding window.  As currently written, it 
doesn't provide the sort of functionality you seem to be asking for.

 Any better advice ?

What do your documents look like?  What do you hope to accomplish using 
ShingleFilter?  It's tough to give advice without knowing what you want to do.

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: N-grams with numbers and Shinglefilters

2009-03-02 Thread Raymond Balmès
Well,

In the mean time I've looked at the details of the implementation and it
gave me an idea for what I'm looking for :

suppose I have a tri-gram, what I want to do is index the tri-gram string
digit1 digit2 as one indexing phrase, and not index each token separately.

In the shingler filter, if I understood it correctly, tokens are separated
by '_' whilst n-grams are separated  by  , that is the mechanism which I
was missing. And of course I need my logic around to filter valid tri-grams
but I don't need help for this, I can easily do that using regex for
instance.

My documents look like regular html or pdf pages although some of them
contains those specific tri-grams.

Thx,


-RB-





On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Raymond,

 On 3/1/2009, Raymond Balmès wrote:
  I'm trying to index ( search later) documents that contain tri-grams
  however they have the following form:
 
  string 2 digit 2 digit
 
  Does the ShingleFilter work with numbers in the match ?

 Yes, though it is the tokenizer and previous filters in the chain that will
 be the (potential) source of difficulties, not ShingleFilter.

  Another complication, in future features I'd like to add optional
  digits like
 
  [1 digit] string 2 digit 2 digit
 
  I suppose the ShingleFilter won't do it ?

 ShingleFilter just pastes together the tokens produced by the previous
 component in the analysis chain, in a sliding window.  As currently written,
 it doesn't provide the sort of functionality you seem to be asking for.

  Any better advice ?

 What do your documents look like?  What do you hope to accomplish using
 ShingleFilter?  It's tough to give advice without knowing what you want to
 do.

 Steve


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




RE: N-grams with numbers and Shinglefilters

2009-03-02 Thread Steven A Rowe
Hi Raymond,

On 3/2/2009 at 10:09 AM, Raymond Balmès wrote:
 suppose I have a tri-gram, what I want to do is index the tri-gram
 string digit1 digit2 as one indexing phrase, and not index each token
 separately.

As long as you don't want any transformation performed on the phrase or its 
components, you can add your phrase as a keyword, i.e. a non-analyzed string 
that will be indexed as-is.

Unless your phrase field will be the only field on this document (pretty 
unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over 
KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you like 
for the other document field(s).

AFAICT, you don't need ShingleFilter.

Steve

[1] PerFieldAnalyzerWrapper:  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
[2] KeywordAnalyzer: 
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org