[
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798268#action_12798268
]
Robert Muir commented on SOLR-1710:
-----------------------------------
chris yeah, its supposed to be similar to
http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29
i started by mimicing this api somewhat, i guess a future improvement would be
if somehow this truly was a real BreakIterator.
Then say, you could create a RuleBasedBreakIterator or
DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how
words are delimited.
currently, you can only do this with by customizing the charTypeTable, which
cannot take any context into account, so its rather limited.
all of the above is really just theoretical and not anything we should worry
about, for practical purposes i mimiced BreakIterator api (but diverged
somewhat), just because I am used to working with it and found it was one way
to separate a lot of the logic.
> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
> Key: SOLR-1710
> URL: https://issues.apache.org/jira/browse/SOLR-1710
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Reporter: Robert Muir
> Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new
> tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a
> BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to
> OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates
> random strings from various subword combinations.
> For each random string, it compares output against the existing
> WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these
> combinations. The bugs discovered in SOLR-1706 are fixed here.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.