[
https://issues.apache.org/jira/browse/LUCENENET-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032982#comment-13032982
]
Digy commented on LUCENENET-414:
Hi Vincent,
I changed the CharArraySet implementation.
Can you take a look at 2.9.4g branch?
(
https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g
)
DIGY
The definition of CharArraySet is dangerously confusing and leads to bugs
when used.
Key: LUCENENET-414
URL: https://issues.apache.org/jira/browse/LUCENENET-414
Project: Lucene.Net
Issue Type: Bug
Components: Lucene.Net Core
Affects Versions: Lucene.Net 2.9.2
Environment: Irrelevant
Reporter: Vincent Van Den Berghe
Priority: Minor
Fix For: Lucene.Net 2.9.2
Right now, CharArraySet derives from System.Collections.Hashtable, but
doesn't actually use this base type for storing elements.
However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a
System.Collections.Hashtable. The trivial code to build your own stopword set
using the StandardAnalyzer.STOP_WORDS_SET and adding your own set of
stopwords like this:
CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET,
ignoreCase: false);
foreach (string domainSpecificStopWord in DomainSpecificStopWords)
stopWords.Add(domainSpecificStopWord);
... will fail because the CharArraySet accepts an ICollection, which will be
passed the Hashtable instance of STOP_WORDS_SET: the resulting myStopWords
will only contain the DomainSpecificStopWords, and not those from
STOP_WORDS_SET.
One workaround would be to replace the first line with this:
CharArraySet stopWords = new
CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count +
DomainSpecificStopWords.Length, ignoreCase: false);
foreach (string domainSpecificStopWord in
(CharArraySet)StandardAnalyzer.STOP_WORDS_SET)
stopWords.Add(domainSpecificStopWord);
... but this makes use of the implementation detail (the STOP_WORDS_SET is
really an UnmodifiableCharArraySet which is itself a CharArraySet). It works
because it forces the foreach() to use the correct
CharArraySet.GetEnumerator(), which is defined as a new method (this has a
bad code smell to it)
At least 2 possibilities exist to solve this problem:
- Make CharArraySet use the Hashtable instance and a custom comparator,
instead of its own implementation.
- Make CharArraySet use HashSetchar[], defined in .NET 4.0.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira