[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798268#action_12798268
 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

chris yeah, its supposed to be similar to 
http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29

i started by mimicing this api somewhat, i guess a future improvement would be 
if somehow this truly was a real BreakIterator.
Then say, you could create a RuleBasedBreakIterator or 
DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how 
words are delimited.
currently, you can only do this with by customizing the charTypeTable, which 
cannot take any context into account, so its rather limited.

all of the above is really just theoretical and not anything we should worry 
about, for practical purposes i mimiced BreakIterator api (but diverged 
somewhat), just because I am used to working with it and found it was one way 
to separate a lot of the logic.


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new 
> tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a 
> BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to 
> OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates 
> random strings from various subword combinations.
> For each random string, it compares output against the existing 
> WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
> combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to