[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug

2016-01-25 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115405#comment-15115405
 ] 

Jack Krupansky commented on LUCENE-6991:


Does seem odd and wrong.

I also notice that it is not generating terms for the single letters from the 
%-escapes: %3A, %2F.

It also seems odd that that long token of catenated word parts is not all of 
the word parts from the URL. It seems like a digit not preceded by a letter is 
causing a break, while a digit preceded by a letter prevents a break.

Since you are using the white space tokenizer, the WDF is only seeing each 
space-delimited term at a time. You might try your test with just the URL 
portion itself, both with and without the escaped quote, just to see if that 
affects anything.


> WordDelimiterFilter bug
> ---
>
> Key: LUCENE-6991
> URL: https://issues.apache.org/jira/browse/LUCENE-6991
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.4, 5.3.1
>Reporter: Pawel Rog
>Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it 
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests 
> for this but this doesn't matter for showing the bug.
> {code}
> String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 
> (X11; Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> List tokens1 = new ArrayList();
> List tokens2 = new ArrayList();
> WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
> TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> CharTermAttribute charAttrib = 
> tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens1.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 \"http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; 
> Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> System.out.println("\n\n\n\n");
> tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens2.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug

2016-01-25 Thread Pawel Rog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115414#comment-15115414
 ] 

Pawel Rog commented on LUCENE-6991:
---

Thanks for the suggestion. When I changed whitespace tokenizer to keyword 
tokenizer the test passes.

> WordDelimiterFilter bug
> ---
>
> Key: LUCENE-6991
> URL: https://issues.apache.org/jira/browse/LUCENE-6991
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.4, 5.3.1
>Reporter: Pawel Rog
>Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it 
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests 
> for this but this doesn't matter for showing the bug.
> {code}
> String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 
> (X11; Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> List tokens1 = new ArrayList();
> List tokens2 = new ArrayList();
> WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
> TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> CharTermAttribute charAttrib = 
> tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens1.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 \"http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; 
> Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> System.out.println("\n\n\n\n");
> tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens2.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug

2016-01-25 Thread Pawel Rog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115029#comment-15115029
 ] 

Pawel Rog commented on LUCENE-6991:
---

Below you can see tokens produced from first token stream and second token 
stream

{code}
Jun

GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
bv
m
mbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}

{code}
Jun

GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
b
vm
vmbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}


The difference in input string is quotation mark before "http". The difference 
in output is in a few terms:

eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv
 vs 
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb

or 
mbv vs vmbv 

> WordDelimiterFilter bug
> ---
>
> Key: LUCENE-6991
> URL: https://issues.apache.org/jira/browse/LUCENE-6991
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.10.4, 5.3.1
>Reporter: Pawel Rog
>Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it 
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests 
> for this but this doesn't matter for showing the bug.
> {code}
> String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 
> (X11; Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> List tokens1 = new ArrayList();
> List tokens2 = new ArrayList();
> WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
> TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> CharTermAttribute charAttrib = 
> tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens1.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET 
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 \"http://www.google.com/url?sa=t=j==s&; +
> 
> "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
>  +
> 
> "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg"
>  +
> "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; 
> Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> System.out.println("\n\n\n\n");
> tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
>   tokens2.add(charAttrib.toString());
>   System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message