[ https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772174#comment-13772174 ]
ASF subversion and git services commented on LUCENE-5211: --------------------------------------------------------- Commit 1524809 from hoss...@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1524809 ] LUCENE-5211: Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option > StopFilterFactory docs do not advertise/explain hte "format" option > ------------------------------------------------------------------- > > Key: LUCENE-5211 > URL: https://issues.apache.org/jira/browse/LUCENE-5211 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 4.2 > Reporter: Hayden Muhl > Assignee: Hoss Man > Priority: Minor > Attachments: LUCENE-5211.code.patch, > LUCENE-5211.stopfilecomments.patch > > > StopFilterFactory supports a "format" option for controlling wether > "getWordSet" or "getSnowballWordSet" is used to parse the file, but this > option is not advertised and people can be confused by looking at the example > stopword files include in the releases (some of which are in the snoball > format w/ "|" comments) and try to use them w/o explicitly specifying > {{format="snowball"}} and silently get useless stopwords (that include the "| > comments" as literal portions of hte stopwrds. > we need to better document the use of "format" and consider updating all of > the example stopword files we ship that are in the snowball format with a > note about the need to use {{format="snowball"}} with those files. > {panel:title=Initial Bug Report} > The StopFilterFactory builds a CharArraySet directly from the raw lines of > the supplied words file. This causes a problem when using the stop word files > supplied with the Solr/Lucene distribution. In particular, the comments in > those files get added to the CharArraySet. A line like this... > ceci | this > Should result in the string "ceci" being added to the CharArraySet, but "ceci > | this" is what actually gets added. > Workaround: Remove all comments from stop word files you are using. > Suggested fix: The StopFilterFactory should strip any comments, then strip > trailing whitespace. The stop word files supplied with the distribution > should be edited to conform to the supported comment format. > {panel} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org