[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Steven Rowe (JIRA) Thu, 27 May 2010 06:54:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872221#action_12872221
 ]


Steven Rowe commented on LUCENE-2167:
-------------------------------------

bq. Must it be a maven plugin? [...] Its much easier from ant to invoke a java 
method as a build step.

Lucene's build could be converted to Maven, though, and this could be a place 
for build-related stuff.

Maven Ant Tasks allows for Ant to call full Maven builds without a Maven 
installation: http://maven.apache.org/ant-tasks/examples/mvn.html

bq. From what I see, the same code could be done as a simple Java Class with 
main() like Roberts ICU converter. [snip]

I hadn't seen Robert's ICU converter - I'll take a look.

bq. A python script may also work well.

Perl is my scripting language of choice, not Python, but yes, a script would 
likely do the trick, assuming there are no external (Java) dependencies.  (And 
as you pointed out, HttpComponents, the only dependency of the Maven plugin, 
does not need to be a dependency.)

bq. You use the HTML root zone database from IANA. The format of this file is 
hard to parse and may change suddenly. BIND administrators know, that there is 
also the root zone file available for BIND in the standardized named-format @ 
http://www.internic.net/zones/root.zone (ASCII only, as DNS is ASCII only).

I think I'll stick with the HTML version for now - there are no decoded 
versions of the internationalized TLDs and no descriptive information in the 
named-format version.  I agree the HTML format is not ideal, but it took me 
just a little while to put together the regexes to parse it; when the format 
changes, the effort to fix will likely be similarly small.

bq. You can write the Last-Modified-Header of the HTTP-date 
(HttpURLConnection.getLastModified()) also into the generated file.

Excellent idea, I searched the HTML page source for this kind of information 
but it wasn't there.

bq. The database only contains the punycode enabled DNS names. But users use 
the non-encoded variants, so you should decode punycode, too [we need ICU for 
that :( ] and create patterns for that, too.

I agree.  However, I looked into what's required to do internationalized domain 
names properly, and it's quite complicated.  I plan on doing what you suggest 
eventually, both for TLDs and all other domain labels, but I'd rather finish 
the ASCII implementation and deal with IRIs in a separate follow-on issue.

bq. About changes in analyzer syntax because of regeneration: This should not 
be a problem, as the IANA only adds new zones to the file and very seldom 
removes some (like old yugoslavian zones). As eMails and Webadresses should not 
appear in tokenized text before they are in the zone file, its no problem that 
they suddenly later are marked as "URL/eMail" (as they cannot appear before). 
So in my opinion we can update the zone database even in minor Lucene releases 
without breaking analyzers.

+1


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to