Pawel Rog created LUCENE-6991:
---------------------------------
Summary: WordDelimiterFilter bug
Key: LUCENE-6991
URL: https://issues.apache.org/jira/browse/LUCENE-6991
Project: Lucene - Core
Issue Type: Bug
Reporter: Pawel Rog
Priority: Minor
I was preparing analyzer which contains WordDelimiterFilter and I realized it
sometimes gives results different then expected.
I prepared a short test which shows the problem. I haven't used Lucene tests
for this but this doesn't matter for showing the bug.
{code}
String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
/products/key-phrase-extractor/ HTTP/1.1\"" +
" 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
"source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
+
"phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
+
"=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0
(X11; Ubuntu; Linux i686; rv:20.0) " +
"Gecko/20100101 Firefox/20.0\"";
List<String> tokens1 = new ArrayList<String>();
List<String> tokens2 = new ArrayList<String>();
WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
tokenStream = new WordDelimiterFilter(tokenStream,
WordDelimiterFilter.GENERATE_WORD_PARTS |
WordDelimiterFilter.CATENATE_WORDS |
WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
null);
CharTermAttribute charAttrib =
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
tokens1.add(charAttrib.toString());
System.out.println(charAttrib.toString());
}
tokenStream.end();
tokenStream.close();
urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
/products/key-phrase-extractor/ HTTP/1.1\"" +
" 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
"source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
+
"phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
+
"=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11;
Ubuntu; Linux i686; rv:20.0) " +
"Gecko/20100101 Firefox/20.0\"";
System.out.println("\n\n====\n\n");
tokenStream = analyzer.tokenStream("test", urlIndexed);
tokenStream = new WordDelimiterFilter(tokenStream,
WordDelimiterFilter.GENERATE_WORD_PARTS |
WordDelimiterFilter.CATENATE_WORDS |
WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
null);
charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
tokens2.add(charAttrib.toString());
System.out.println(charAttrib.toString());
}
tokenStream.end();
tokenStream.close();
assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]