which Snowball stemmers are in PyLucene?
Is there a programmatic way to figure out whether the Snowball stemmer for a particular language X is supported in a particular installation of PyLucene? Bill
Re: which Snowball stemmers are in PyLucene?
On Oct 28, 2009, at 12:09, Bill Janssen jans...@parc.com wrote: Andi Vajda va...@apache.org wrote: The snowball JAR comes from this statement in the Makefile: SNOWBALL_JAR=$(LUCENE)/build/contrib/snowball/lucene-snowball-$ (LUCENE_VER).jar Which means that it's whatever corresponds to the Lucene version checked out. For PyLucene 2.9.0, that is: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_9_0 In other words, this is a question best asked on the java-u...@lucene.apache.org mailing list as PyLucene doesn't do anything different (at least intentionally). I've looked through that set of APIs, and don't see anything useful. This was more of a brainstorming question for the list... What could we do in Python to enumerate the list? import lucene lucene.initVM(classpath=lucene.CLASSPATH) for n,v in lucene.__dict__.items(): ...if n.endswith(Stemmer): ... print n, lucene.SnowballProgram.instance_(v) ... That is checking if a class is an instance of SnowballProgram which is probably not what you want. Use isAssignableFrom() maybe ? There may be an API in the Snowball library to do this enumeration. I don't know and that's why I suggested asking java-user. Nothing wrong with brainstorming here, of course. Andi.. ItalianStemmer False FrenchStemmer False HungarianStemmer False LovinsStemmer False RussianStemmer False FinnishStemmer False PortugueseStemmer False KpStemmer False BrazilianStemmer False DanishStemmer False TurkishStemmer False DutchStemmer False SwedishStemmer False German2Stemmer False EnglishStemmer False GermanStemmer False RomanianStemmer False PorterStemmer False NorwegianStemmer False SpanishStemmer False Seems to me that this should give different results. Am I using the JCC instance_ method improperly? Bill
Re: which Snowball stemmers are in PyLucene?
On Wed, 28 Oct 2009, Marvin Humphrey wrote: On Wed, Oct 28, 2009 at 12:20:55PM -0700, Andi Vajda wrote: There may be an API in the Snowball library to do this enumeration. There's this, from libstemmer.h: /** Returns an array of the names of the available stemming algorithms. * Note that these are the canonical names - aliases (ie, other names for * the same algorithm) will not be included in the list. * The list is terminated with a null pointer. * * The list must not be modified in any way. */ const char ** sb_stemmer_list(void); Sorry, I should have said in the java library. Andi..
Re: Pylucene and JCC 2.4.1
On Wed, 28 Oct 2009, Andi Vajda wrote: On Oct 28, 2009, at 2:45, Manolo Padron Martinez manol...@gmail.com wrote: What is the version of your gcc ? I did the same build today on Ubuntu Gutsy 64 bits without any problem. gcc (Debian 4.3.2-1.1) 4.3.2 g++ (Debian 4.3.2-1.1) 4.3.2 Here are a few things you could try, in no particular order: So I installed Debian 5 (Lenny) myself onto a virtual machine and built JCC and PyLucene 2.9.1 from the trunk of the 2.9 branch. - gcc 4.2 Using gcc 4.2 worked fine. I had to ensure the /usr/bin/g++ and /usr/bin/gcc links point at /usr/bin/g++-4.2 and /usr/bin/gcc-4.2 respectively. I then moved the links to the 4.3 versions and rebuilt JCC and PyLucene again. The build succeeded as well. In other words, the problem didn't reproduce. In both cases, the tests passed (make test). I guess, I can't reproduce the problem. I noticed that both compilers use large amounts of memory when compiling these large C++ files. Maybe you don't have enough memory on your system. I gave mine 512mb and it swapped like mad. I then gave it 1gb of RAM and the builds completed fine. If that's indeed the problem, get JCC to generate smaller but more numerous files by increasing NUM_FILES. Just guessing here as to what you be the problem on your side. Andi..