RE: N-grams with numbers and Shinglefilters
Hi Raymond, On 3/1/2009, Raymond Balmès wrote: I'm trying to index ( search later) documents that contain tri-grams however they have the following form: string 2 digit 2 digit Does the ShingleFilter work with numbers in the match ? Yes, though it is the tokenizer and previous filters in the chain that will be the (potential) source of difficulties, not ShingleFilter. Another complication, in future features I'd like to add optional digits like [1 digit] string 2 digit 2 digit I suppose the ShingleFilter won't do it ? ShingleFilter just pastes together the tokens produced by the previous component in the analysis chain, in a sliding window. As currently written, it doesn't provide the sort of functionality you seem to be asking for. Any better advice ? What do your documents look like? What do you hope to accomplish using ShingleFilter? It's tough to give advice without knowing what you want to do. Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: N-grams with numbers and Shinglefilters
Well, In the mean time I've looked at the details of the implementation and it gave me an idea for what I'm looking for : suppose I have a tri-gram, what I want to do is index the tri-gram string digit1 digit2 as one indexing phrase, and not index each token separately. In the shingler filter, if I understood it correctly, tokens are separated by '_' whilst n-grams are separated by , that is the mechanism which I was missing. And of course I need my logic around to filter valid tri-grams but I don't need help for this, I can easily do that using regex for instance. My documents look like regular html or pdf pages although some of them contains those specific tri-grams. Thx, -RB- On Mon, Mar 2, 2009 at 2:37 PM, Steven A Rowe sar...@syr.edu wrote: Hi Raymond, On 3/1/2009, Raymond Balmès wrote: I'm trying to index ( search later) documents that contain tri-grams however they have the following form: string 2 digit 2 digit Does the ShingleFilter work with numbers in the match ? Yes, though it is the tokenizer and previous filters in the chain that will be the (potential) source of difficulties, not ShingleFilter. Another complication, in future features I'd like to add optional digits like [1 digit] string 2 digit 2 digit I suppose the ShingleFilter won't do it ? ShingleFilter just pastes together the tokens produced by the previous component in the analysis chain, in a sliding window. As currently written, it doesn't provide the sort of functionality you seem to be asking for. Any better advice ? What do your documents look like? What do you hope to accomplish using ShingleFilter? It's tough to give advice without knowing what you want to do. Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: N-grams with numbers and Shinglefilters
Hi Raymond, On 3/2/2009 at 10:09 AM, Raymond Balmès wrote: suppose I have a tri-gram, what I want to do is index the tri-gram string digit1 digit2 as one indexing phrase, and not index each token separately. As long as you don't want any transformation performed on the phrase or its components, you can add your phrase as a keyword, i.e. a non-analyzed string that will be indexed as-is. Unless your phrase field will be the only field on this document (pretty unlikely), you'll want to use PerFieldAnalyzerWrapper[1] over KeywordAnalyzer[2] for the phrase field, and whatever other analyzer you like for the other document field(s). AFAICT, you don't need ShingleFilter. Steve [1] PerFieldAnalyzerWrapper: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html [2] KeywordAnalyzer: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org