On Wed, Nov 10, 2010 at 12:10:51PM -0500, Robert Muir wrote:
> One more note that I forgot to mention: in snowball's svn (but i think not
> in the libstemmer pkg) there is actually vocabulary test data: input files
> containing a sample vocabulary for each language, expected output, and
> combined files called 'diffs' that show what the stemmer changes.
> 
> these provide pretty good coverage for tests to ensure your
> integration is working... when they make a change to the algorithms
> these are updated too (though it seems not always in the same commit):
> 
> example: 
> http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527

I used this sample data to prepare tests for the Lingua::Stem::Snowball CPAN
distribution.  Now that we are bundling the Snowball C libraries, we are no
longer benefitting by proxy from that test suite, and we should roll our own
tests.

Yesterday, I adapted the update_snowstem.pl script in
<https://issues.apache.org/jira/browse/LUCY-125> to work off of an svn
checkout of Snowball; I committed the patches and closed the issue this
morning.

Now I'll go add test data generation to update_snowstem.pl's capabilities and
add new test files for each language to validate that our stemmers work
properly.

Thanks for bringing it up!

Marvin Humphrey

Reply via email to