Well,

In the mean time I've looked at the details of the implementation and it
gave me an idea for what I'm looking for :

suppose I have a tri-gram, what I want to do is index the tri-gram "string
digit1 digit2" as one indexing phrase, and not index each token separately.

In the shingler filter, if I understood it correctly, tokens are separated
by '_' whilst n-grams are separated  by " ", that is the mechanism which I
was missing. And of course I need my logic around to filter valid tri-grams
but I don't need help for this, I can easily do that using regex for
instance.

My documents look like regular html or pdf pages although some of them
contains those specific tri-grams.

Thx,


-RB-





On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Raymond,
>
> On 3/1/2009, Raymond Balmès wrote:
> > I'm trying to index (& search later) documents that contain tri-grams
> > however they have the following form:
> >
> > <string> <2 digit> <2 digit>
> >
> > Does the ShingleFilter work with numbers in the match ?
>
> Yes, though it is the tokenizer and previous filters in the chain that will
> be the (potential) source of difficulties, not ShingleFilter.
>
> > Another complication, in future features I'd like to add optional
> > digits like
> >
> > [<1 digit>] <string> <2 digit> <2 digit>
> >
> > I suppose the ShingleFilter won't do it ?
>
> ShingleFilter just pastes together the tokens produced by the previous
> component in the analysis chain, in a sliding window.  As currently written,
> it doesn't provide the sort of functionality you seem to be asking for.
>
> > Any better advice ?
>
> What do your documents look like?  What do you hope to accomplish using
> ShingleFilter?  It's tough to give advice without knowing what you want to
> do.
>
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to