[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug
[ https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115405#comment-15115405 ] Jack Krupansky commented on LUCENE-6991: Does seem odd and wrong. I also notice that it is not generating terms for the single letters from the %-escapes: %3A, %2F. It also seems odd that that long token of catenated word parts is not all of the word parts from the URL. It seems like a digit not preceded by a letter is causing a break, while a digit preceded by a letter prevents a break. Since you are using the white space tokenizer, the WDF is only seeing each space-delimited term at a time. You might try your test with just the URL portion itself, both with and without the escaped quote, just to see if that affects anything. > WordDelimiterFilter bug > --- > > Key: LUCENE-6991 > URL: https://issues.apache.org/jira/browse/LUCENE-6991 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.4, 5.3.1 >Reporter: Pawel Rog >Priority: Minor > > I was preparing analyzer which contains WordDelimiterFilter and I realized it > sometimes gives results different then expected. > I prepared a short test which shows the problem. I haven't used Lucene tests > for this but this doesn't matter for showing the bug. > {code} > String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 > (X11; Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > List tokens1 = new ArrayList(); > List tokens2 = new ArrayList(); > WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); > TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > CharTermAttribute charAttrib = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens1.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 \"http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; > Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > System.out.println("\n\n\n\n"); > tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > charAttrib = tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens2.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2)); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug
[ https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115414#comment-15115414 ] Pawel Rog commented on LUCENE-6991: --- Thanks for the suggestion. When I changed whitespace tokenizer to keyword tokenizer the test passes. > WordDelimiterFilter bug > --- > > Key: LUCENE-6991 > URL: https://issues.apache.org/jira/browse/LUCENE-6991 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.4, 5.3.1 >Reporter: Pawel Rog >Priority: Minor > > I was preparing analyzer which contains WordDelimiterFilter and I realized it > sometimes gives results different then expected. > I prepared a short test which shows the problem. I haven't used Lucene tests > for this but this doesn't matter for showing the bug. > {code} > String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 > (X11; Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > List tokens1 = new ArrayList(); > List tokens2 = new ArrayList(); > WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); > TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > CharTermAttribute charAttrib = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens1.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 \"http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; > Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > System.out.println("\n\n\n\n"); > tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > charAttrib = tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens2.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2)); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug
[ https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115029#comment-15115029 ] Pawel Rog commented on LUCENE-6991: --- Below you can see tokens produced from first token stream and second token stream {code} Jun GET products productskeyphraseextractor key phrase extractor HTTP 200 3437 http httpwwwgooglecomurlsatrctjqesrcssourcewebcd www google com url sa t rct j q esrc s source web cd cad cadrjaved rja ved QFj QFjAEOAourlhttp AEOAo url http sematext sematextcom com phrase phraseextractor extractor ei eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv TPOu Uba WM OKi Qf Gx IGYDw usg AFQj CNGw YAFYg M3EZnp2e EWJzdv Rr VPrg sig2 o Yit ONI2EIZ0CQar7Ej8HA bv m mbv bv d daGc a Gc Mozilla X11 Ubuntu Linux i686 rv Gecko Firefox {code} {code} Jun GET products productskeyphraseextractor key phrase extractor HTTP 200 3437 http httpwwwgooglecomurlsatrctjqesrcssourcewebcd www google com url sa t rct j q esrc s source web cd cad cadrjaved rja ved QFj QFjAEOAourlhttp AEOAo url http sematext sematextcom com phrase phraseextractor extractor ei eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb TPOu Uba WM OKi Qf Gx IGYDw usg AFQj CNGw YAFYg M3EZnp2e EWJzdv Rr VPrg sig2 o Yit ONI2EIZ0CQar7Ej8HA b vm vmbv bv d daGc a Gc Mozilla X11 Ubuntu Linux i686 rv Gecko Firefox {code} The difference in input string is quotation mark before "http". The difference in output is in a few terms: eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv vs eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb or mbv vs vmbv > WordDelimiterFilter bug > --- > > Key: LUCENE-6991 > URL: https://issues.apache.org/jira/browse/LUCENE-6991 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.4, 5.3.1 >Reporter: Pawel Rog >Priority: Minor > > I was preparing analyzer which contains WordDelimiterFilter and I realized it > sometimes gives results different then expected. > I prepared a short test which shows the problem. I haven't used Lucene tests > for this but this doesn't matter for showing the bug. > {code} > String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 > (X11; Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > List tokens1 = new ArrayList(); > List tokens2 = new ArrayList(); > WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); > TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > CharTermAttribute charAttrib = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens1.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 \"http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; > Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > System.out.println("\n\n\n\n"); > tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > charAttrib = tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens2.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2)); > {code} -- This message