[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517371
 ] 

Stanislaw Osinski commented on LUCENE-966:
------------------------------------------

Okkkk -- only now I realized I made a really silly mistake :) When using Mark's 
examples I somehow took the ",type" substring as part of the token image, which 
made the JavaCC tokenizer look "buggy"...  Apologies for the confusion, 
tomorrow in the morning I'll correct my tests and will see what's happening.

One more important clarification -- the tokenizer from the last patch 
(jflex-analyzer-r561693-compatibility.txt) has a completely different 
definition of the <NUM> token -- it allows digits in any segment, hence the 
totally different results. If we want to be compatible with the 
StandardAnalyzer, we should forget about that patch.

Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with your 
wikipedia diff test? That's the early one whose grammar was "dot for dot" 
translated from the original JavaCC spec -- for further patches I did some 
"optimizations", which seem to have broken the compatibility...

Incidentally, what was the motivation for requiring the <NUM> token to have 
numbers only in every second segment and not in any segment?



> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, 
> jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to