[
https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115029#comment-15115029
]
Pawel Rog commented on LUCENE-6991:
-----------------------------------
Below you can see tokens produced from first token stream and second token
stream
{code}
Jun
0000
GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
bv
m
mbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}
{code}
Jun
0000
GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
b
vm
vmbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}
The difference in input string is quotation mark before "http". The difference
in output is in a few terms:
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv
vs
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb
or
mbv vs vmbv
> WordDelimiterFilter bug
> -----------------------
>
> Key: LUCENE-6991
> URL: https://issues.apache.org/jira/browse/LUCENE-6991
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.10.4, 5.3.1
> Reporter: Pawel Rog
> Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it
> sometimes gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests
> for this but this doesn't matter for showing the bug.
> {code}
> String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
>
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
> +
>
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
> +
> "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0
> (X11; Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> List<String> tokens1 = new ArrayList<String>();
> List<String> tokens2 = new ArrayList<String>();
> WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
> TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> CharTermAttribute charAttrib =
> tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
> tokens1.add(charAttrib.toString());
> System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET
> /products/key-phrase-extractor/ HTTP/1.1\"" +
> " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
>
> "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
> +
>
> "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
> +
> "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11;
> Ubuntu; Linux i686; rv:20.0) " +
> "Gecko/20100101 Firefox/20.0\"";
> System.out.println("\n\n====\n\n");
> tokenStream = analyzer.tokenStream("test", urlIndexed);
> tokenStream = new WordDelimiterFilter(tokenStream,
> WordDelimiterFilter.GENERATE_WORD_PARTS |
> WordDelimiterFilter.CATENATE_WORDS |
> WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
> null);
> charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while(tokenStream.incrementToken()) {
> tokens2.add(charAttrib.toString());
> System.out.println(charAttrib.toString());
> }
> tokenStream.end();
> tokenStream.close();
> assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]