[
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281171#comment-15281171
]
Steve Rowe edited comment on LUCENE-6993 at 5/12/16 4:03 AM:
-------------------------------------------------------------
Hi Mike, my review of your latest patch:
* All the on-or-after tests in analysis factories should switch to 6.1.0 (since
that's where this will likely land)
* I agree with Robert that ClassicTokenizer should stay on Unicode 3.0 - you
changed it to 7.0. That means it doesn't need version-specific behavior, so
the factory changes and the build.xml targets aren't required.
* WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially
a cloned ClassicTokenizer with some modifications for Wiki syntax. I vote for
keeping it at Unicode 3.0 and reverting the factory changes and the build.xml
targets. Performing an effective upgrade would probably mean cloning the
current StandardTokenizer and then re-layering the wiki syntax.
* In generateJavaUnicodeWordBreakTest.pl, you added the test for the double
quote code point: {{push @tokens, '"'.join('', map \{ $_ ~~ /0022/ ? "\\\"" :
"⧹⧹u$\_" } @chars).'"';}} - why use the smartmatch operator here? No recursion
required or unknown types here. Just {{/0022/}} instead of {{$_ ~~ /0022/}}
would work, right?
* TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to
WordBreakTestUnicode_8_0_0, but that should be _7_0_0.
* In the run-jflex-and-disable-buffer-expansion target in build.xml, you
removed the comment with the link to the relevant JIRA - why?
* You said:
bq. I looked at the new generated jflex code and I think it takes care of the
buffer expansion issue.\
+1 - LGTM
* Robert said:
bq. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a
testBackcompat40 which calls setVersion and ensures it works.
but AFAICT you didn't put any backcompat tests in place?
* you said:
bq. ant jflex-legacy # For some reason this needs to be run separately from the
jflex command. I could never figure out why.
What happens if you make it a dependency of the jflex target?
was (Author: steve_rowe):
Hi Mike, my review of your latest patch:
* All the on-or-after tests in analysis factories should switch to 6.1.0 (since
that's where this will likely land)
* I agree with Robert that ClassicTokenizer should stay on Unicode 3.0 - you
changed it to 7.0. That means it doesn't need version-specific behavior, so
the factory changes and the build.xml targets aren't required.
* WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially
a cloned ClassicTokenizer with some modifications for Wiki syntax. I vote for
keeping it at Unicode 3.0 and reverting the factory changes and the build.xml
targets. Performing an effective upgrade would probably mean cloning the
current StandardTokenizer and then re-layering the wiki syntax.
* In generateJavaUnicodeWordBreakTest.pl, you added the test for the double
quote code point: {{push @tokens, '"'.join('', map \{ $_ ~~ /0022/ ? "\\\"" :
"\ \u$\_" } @chars).'"';}} - why use the smartmatch operator here? No recursion
required or unknown types here. Just {{/0022/}} instead of {{$_ ~~ /0022/}}
would work, right?
* TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to
WordBreakTestUnicode_8_0_0, but that should be _7_0_0.
* In the run-jflex-and-disable-buffer-expansion target in build.xml, you
removed the comment with the link to the relevant JIRA - why?
* You said:
bq. I looked at the new generated jflex code and I think it takes care of the
buffer expansion issue.\
+1 - LGTM
* Robert said:
bq. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a
testBackcompat40 which calls setVersion and ensures it works.
but AFAICT you didn't put any backcompat tests in place?
* you said:
bq. ant jflex-legacy # For some reason this needs to be run separately from the
jflex command. I could never figure out why.
What happens if you make it a dependency of the jflex target?
> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-6993
> URL: https://issues.apache.org/jira/browse/LUCENE-6993
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Mike Drob
> Assignee: Robert Muir
> Fix For: 6.x
>
> Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch,
> LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch,
> LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the
> list of TLDs again. Comparing our old list with a new list indicates 800+ new
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]