Thanks for the responses. They've given me much food for thought. -----Original Message----- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: 20 Sep 2012 02 19 To: java-user@lucene.apache.org Subject: RE: Using stop words with snowball analyzer and shingle filter
Hi Martin, SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 5.0. Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for PorterStemmer; disabling stopword holes' position increments; and adding ShingleFilter), that should basically do what you want: ------ String[] stopWords = new String[] { ... }; Set<?> stopSet = StopFilter.makeStopSet(matchVersion, stopWords); String[] stemExclusions = new String[] { ... }; Set<?> stemExclusionsSet = new HashSet<?>(); stemExclusionsSet.addAll(Arrays.asList(stemExclusions)); matchVersion = Version.LUCENE_3X; Analyzer analyzer = new ReusableAnalyzerBase() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new StandardTokenizer(matchVersion, reader); TokenStream result = new StandardFilter(matchVersion, source); // prior to this we get the classic behavior, standardfilter does it for us. if (matchVersion.onOrAfter(Version.LUCENE_31)) result = new EnglishPossessiveFilter(matchVersion, result); result = new LowerCaseFilter(matchVersion, result); result = new StopFilter(matchVersion, result, stopSet); ((StopFilter)result).setEnablePositionIncrements(false); // Disable holes' position increments if (stemExclusionsSet.size() > 0) { result = new KeywordMarkerFilter(result, stemExclusionsSet); } result = new SnowballFilter(result, "English"); result = new ShingleFilter(result, this.getnGramLength()); return new TokenStreamComponents(source, result); } }; ------ Steve -----Original Message----- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, September 19, 2012 7:16 PM To: java-user@lucene.apache.org Subject: Re: Using stop words with snowball analyzer and shingle filter The underscores are due to the fact that the StopFilter defaults to "enable position increments", so there are no terms at the positions where the stop words appeared in the source text. Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is "final" so you can't subclass it to override the "createComponents" method that creates the StopFilter, so you would essentially have to copy the source for SnowballAnalyzer and then add in the code to invoke StopFilter.setEnablePositionIncrements the way StopFilterFactory does. -- Jack Krupansky -----Original Message----- From: Martin O'Shea Sent: Wednesday, September 19, 2012 4:24 AM To: java-user@lucene.apache.org Subject: Using stop words with snowball analyzer and shingle filter I'm currently giving the user an option to include stop words or not when filtering a body of text for ngram frequencies. Typically, this is done as follows: snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords); shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength()); stopWords is set to either a full list of words to include in ngrams or to remove from them. this.getnGramLength()); simply contains the current ngram length up to a maximum of three. If I use stopwords in filtering text "satellite is definitely falling to Earth" for trigrams, the output is: No=1, Key=to, Freq=1 No=2, Key=definitely, Freq=1 No=3, Key=falling to earth, Freq=1 No=4, Key=satellite, Freq=1 No=5, Key=is, Freq=1 No=6, Key=definitely falling to, Freq=1 No=7, Key=definitely falling, Freq=1 No=8, Key=falling, Freq=1 No=9, Key=to earth, Freq=1 No=10, Key=satellite is, Freq=1 No=11, Key=is definitely, Freq=1 No=12, Key=falling to, Freq=1 No=13, Key=is definitely falling, Freq=1 No=14, Key=earth, Freq=1 No=15, Key=satellite is definitely, Freq=1 But if I don't use stopwords for trigrams , the output is this: No=1, Key=satellite, Freq=1 No=2, Key=falling _, Freq=1 No=3, Key=satellite _ _, Freq=1 No=4, Key=_ earth, Freq=1 No=5, Key=falling, Freq=1 No=6, Key=satellite _, Freq=1 No=7, Key=_ _, Freq=1 No=8, Key=_ falling _, Freq=1 No=9, Key=falling _ earth, Freq=1 No=10, Key=_, Freq=3 No=11, Key=earth, Freq=1 No=12, Key=_ _ falling, Freq=1 No=13, Key=_ falling, Freq=1 Why am I seeing underscores? I would have thought to see simple unigrams, "satellite falling" and "falling earth", and "satellite falling earth"? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org