[ 
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115405#comment-15115405
 ] 

Jack Krupansky commented on LUCENE-6991:
----------------------------------------

Does seem odd and wrong.

I also notice that it is not generating terms for the single letters from the 
%-escapes: %3A, %2F.

It also seems odd that that long token of catenated word parts is not all of 
the word parts from the URL. It seems like a digit not preceded by a letter is 
causing a break, while a digit preceded by a letter prevents a break.

Since you are using the white space tokenizer, the WDF is only seeing each 
space-delimited term at a time. You might try your test with just the URL 
portion itself, both with and without the escaped quote, just to see if that 
affects anything.


> WordDelimiterFilter bug
> -----------------------
>
>                 Key: LUCENE-6991
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6991
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.4, 5.3.1
>            Reporter: Pawel Rog
>            Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it 
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests 
> for this but this doesn't matter for showing the bug.
> {code}
>     String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
>             " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"; +
>             
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
>             
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
>  +
>             "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 
> (X11; Ubuntu; Linux i686; rv:20.0) " +
>             "Gecko/20100101 Firefox/20.0\"";
>     List<String> tokens1 = new ArrayList<String>();
>     List<String> tokens2 = new ArrayList<String>();
>     WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
>     TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     CharTermAttribute charAttrib = 
> tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens1.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
>         " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"; +
>         
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
>         
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
>  +
>         "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; 
> Ubuntu; Linux i686; rv:20.0) " +
>         "Gecko/20100101 Firefox/20.0\"";
>     System.out.println("\n\n====\n\n");
>     tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens2.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to