Well, In the mean time I've looked at the details of the implementation and it gave me an idea for what I'm looking for :
suppose I have a tri-gram, what I want to do is index the tri-gram "string digit1 digit2" as one indexing phrase, and not index each token separately. In the shingler filter, if I understood it correctly, tokens are separated by '_' whilst n-grams are separated by " ", that is the mechanism which I was missing. And of course I need my logic around to filter valid tri-grams but I don't need help for this, I can easily do that using regex for instance. My documents look like regular html or pdf pages although some of them contains those specific tri-grams. Thx, -RB- On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Raymond, > > On 3/1/2009, Raymond Balmès wrote: > > I'm trying to index (& search later) documents that contain tri-grams > > however they have the following form: > > > > <string> <2 digit> <2 digit> > > > > Does the ShingleFilter work with numbers in the match ? > > Yes, though it is the tokenizer and previous filters in the chain that will > be the (potential) source of difficulties, not ShingleFilter. > > > Another complication, in future features I'd like to add optional > > digits like > > > > [<1 digit>] <string> <2 digit> <2 digit> > > > > I suppose the ShingleFilter won't do it ? > > ShingleFilter just pastes together the tokens produced by the previous > component in the analysis chain, in a sliding window. As currently written, > it doesn't provide the sort of functionality you seem to be asking for. > > > Any better advice ? > > What do your documents look like? What do you hope to accomplish using > ShingleFilter? It's tough to give advice without knowing what you want to > do. > > Steve > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >