[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

Grant Ingersoll (JIRA) Tue, 29 Apr 2008 18:24:51 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593198#action_12593198
 ]


Grant Ingersoll commented on LUCENE-1166:
-----------------------------------------

This looks pretty good Thomas.  I think the last bit that would be good is to 
add to the package docs an example of start to finish using it, kind of like in 
the test case.  You might want to explain a little bit about where to get the 
hyphenation files, etc. (if I am understanding them correctly). 

I think if we can finish that up, we can look to commit.

The other interesting thing here, as an aside, is the Ternary Tree might be 
worth pulling up to a "util" package (no need to do so now, just thinking out 
loud), as it could be used for other interesting things.  For instance, see 
http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html   The 
version we have needs a little work, but I have been thinking about how it 
might be used to improve spelling, etc.

> A tokenfilter to decompose compound words
> -----------------------------------------
>
>                 Key: LUCENE-1166
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1166
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Thomas Peuss
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages 
> (like German, Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
> that you can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP 
> (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
> Currently I use the FOP jars directly. I only use a handful of classes from 
> the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the 
> packages of course) or should I stick with the dependency to the FOP jars? 
> The FOP code uses the ASF V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

Reply via email to