[jira] [Comment Edited] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Steve Rowe (JIRA) Wed, 11 May 2016 21:04:35 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281171#comment-15281171
 ]


Steve Rowe edited comment on LUCENE-6993 at 5/12/16 4:03 AM:
-------------------------------------------------------------

Hi Mike, my review of your latest patch:

* All the on-or-after tests in analysis factories should switch to 6.1.0 (since 
that's where this will likely land)
* I agree with Robert that ClassicTokenizer should stay on Unicode 3.0  - you 
changed it to 7.0.  That means it doesn't need version-specific behavior, so 
the factory changes and the build.xml targets aren't required.
* WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially 
a cloned ClassicTokenizer with some modifications for Wiki syntax.  I vote for 
keeping it at Unicode 3.0 and reverting the factory changes and the build.xml 
targets.  Performing an effective upgrade would probably mean cloning the 
current StandardTokenizer and then re-layering the wiki syntax.
* In generateJavaUnicodeWordBreakTest.pl, you added the test for the double 
quote code point: {{push @tokens, '"'.join('', map \{ $_ ~~ /0022/ ? "\\\"" : 
"⧹⧹u$\_" } @chars).'"';}} - why use the smartmatch operator here? No recursion 
required or unknown types here. Just {{/0022/}} instead of {{$_ ~~ /0022/}} 
would work, right?
* TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to 
WordBreakTestUnicode_8_0_0, but that should be _7_0_0.
* In the run-jflex-and-disable-buffer-expansion target in build.xml, you 
removed the comment with the link to the relevant JIRA - why?
* You said:
bq. I looked at the new generated jflex code and I think it takes care of the 
buffer expansion issue.\
+1 - LGTM
* Robert said:
bq. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a 
testBackcompat40 which calls setVersion and ensures it works.
but AFAICT you didn't put any backcompat tests in place?
* you said:
bq. ant jflex-legacy # For some reason this needs to be run separately from the 
jflex command. I could never figure out why.
What happens if you make it a dependency of the jflex target?


was (Author: steve_rowe):
Hi Mike, my review of your latest patch:

* All the on-or-after tests in analysis factories should switch to 6.1.0 (since 
that's where this will likely land)
* I agree with Robert that ClassicTokenizer should stay on Unicode 3.0  - you 
changed it to 7.0.  That means it doesn't need version-specific behavior, so 
the factory changes and the build.xml targets aren't required.
* WikipediaTokenizer is in the same boat as ClassicTokenizer - it's essentially 
a cloned ClassicTokenizer with some modifications for Wiki syntax.  I vote for 
keeping it at Unicode 3.0 and reverting the factory changes and the build.xml 
targets.  Performing an effective upgrade would probably mean cloning the 
current StandardTokenizer and then re-layering the wiki syntax.
* In generateJavaUnicodeWordBreakTest.pl, you added the test for the double 
quote code point: {{push @tokens, '"'.join('', map \{ $_ ~~ /0022/ ? "\\\"" : 
"\ \u$\_" } @chars).'"';}} - why use the smartmatch operator here? No recursion 
required or unknown types here. Just {{/0022/}} instead of {{$_ ~~ /0022/}} 
would work, right?
* TestStandardAnalyzer and TestUAX29URLEmailTokenizer refer to 
WordBreakTestUnicode_8_0_0, but that should be _7_0_0.
* In the run-jflex-and-disable-buffer-expansion target in build.xml, you 
removed the comment with the link to the relevant JIRA - why?
* You said:
bq. I looked at the new generated jflex code and I think it takes care of the 
buffer expansion issue.\
+1 - LGTM
* Robert said:
bq. TestStandardAnalyzer and TestUAX29URLEmailAnalyzer also have a 
testBackcompat40 which calls setVersion and ensures it works.
but AFAICT you didn't put any backcompat tests in place?
* you said:
bq. ant jflex-legacy # For some reason this needs to be run separately from the 
jflex command. I could never figure out why.
What happens if you make it a dependency of the jflex target?

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.x
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Reply via email to