[ 
https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174781#comment-14174781
 ] 

Alex commented on PYLUCENE-32:
------------------------------

Thanks Andi. But am using pylucene version 3.6.2. I think the problem has to do 
with jvm instantiation caused by java-python incompatible array issues but I 
dont know how to solve this. Below are the java files I added to class to 
lucene core perhaps you will have more understanding of what the issue is:

The lemmatizer:
/*
 * Lemmatizing library for Lucene
 * Copyright (C) 2010 Lars Buitinck
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */

package englishlemma;

import java.io.*;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.tagger.maxent.TaggerConfig;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;

/**
 * An analyzer that uses an {@link EnglishLemmaTokenizer}.
 *
 * @author  Lars Buitinck
 * @version 2010.1006
 */
public class EnglishLemmaAnalyzer extends Analyzer {
    private MaxentTagger posTagger;

    /**
     * Construct an analyzer with a tagger using the given model file.
     */
    public EnglishLemmaAnalyzer(String posModelFile) throws Exception {
        this(makeTagger(posModelFile));
    }

    /**
     * Construct an analyzer using the given tagger.
     */
    public EnglishLemmaAnalyzer(MaxentTagger tagger) {
        posTagger = tagger;
    }

    /**
     * Factory method for loading a POS tagger.
     */
    public static MaxentTagger makeTagger(String modelFile) throws Exception {
        TaggerConfig config = new TaggerConfig("-model", modelFile);
        // The final argument suppresses a "loading" message on stderr.
        return new MaxentTagger(modelFile, config, false);
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader input) {
        return new EnglishLemmaTokenizer(input, posTagger);
    }
}


The tokenizer for the lemmatizer:
/*
 * Lemmatizing library for Lucene
 * Copyright (c) 2010-2011 Lars Buitinck
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */

package englishlemma;

import java.io.*;
import java.util.*;
import java.util.regex.*;
import com.google.common.collect.Iterables;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.process.Morphology;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

/**
 * A tokenizer that retrieves the lemmas (base forms) of English words.
 * Relies internally on the sentence splitter and tokenizer supplied with
 * the Stanford POS tagger.
 *
 * @author  Lars Buitinck
 * @version 2011.0122
 */
public class EnglishLemmaTokenizer extends TokenStream {
    private Iterator<TaggedWord> tagged;
    private PositionIncrementAttribute posIncr;
    private TaggedWord currentWord;
    private TermAttribute termAtt;
    private boolean lemmaNext;

    /**
     * Construct a tokenizer processing the given input and a tagger
     * using the given model file.
     */
    public EnglishLemmaTokenizer(Reader input, String posModelFile)
            throws Exception {
        this(input, EnglishLemmaAnalyzer.makeTagger(posModelFile));
    }

    /**
     * Construct a tokenizer processing the given input using the given tagger.
     */
    public EnglishLemmaTokenizer(Reader input, MaxentTagger tagger) {
        super();

        lemmaNext = false;
        posIncr = addAttribute(PositionIncrementAttribute.class);
        termAtt = addAttribute(TermAttribute.class);

        List<List<HasWord>> tokenized =
            MaxentTagger.tokenizeText(input);
        tagged = Iterables.concat(tagger.process(tokenized)).iterator();
    }

    /**
     * Consumers use this method to advance the stream to the next token.
     * The token stream emits inflected forms and lemmas interleaved (form1,
     * lemma1, form2, lemma2, etc.), giving lemmas and their inflected forms
     * the same PositionAttribute.
     */
    @Override
    public final boolean incrementToken() throws IOException {
        if (lemmaNext) {
            // Emit a lemma
            posIncr.setPositionIncrement(1);
            String tag  = currentWord.tag();
            String form = currentWord.word();
            termAtt.setTermBuffer(Morphology.stemStatic(form, tag).word());
        } else {
            // Emit inflected form, if not filtered out.

            // 0 because the lemma will come in the same position
            int increment = 0;
            for (;;) {
                if (!tagged.hasNext())
                    return false;
                currentWord = tagged.next();
                if (!unwantedPOS(currentWord.tag()))
                    break;
                increment++;
            }

            posIncr.setPositionIncrement(increment);
            termAtt.setTermBuffer(currentWord.word());
        }

        lemmaNext = !lemmaNext;
        return true;
    }

    private static final Pattern unwantedPosRE = Pattern.compile(
      "^(CC|DT|[LR]RB|MD|POS|PRP|UH|WDT|WP|WP\\$|WRB|\\$|\\#|\\.|\\,|:)$"
    );

    /**
     * Determines if words with a given POS tag should be omitted from the
     * index. Defaults to filtering out punctuation and function words
     * (pronouns, prepositions, "the", "a", etc.).
     *
     * @see <a 
href="http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html";>The
 Penn Treebank tag set</a> used by Stanford NLP
     */
    protected boolean unwantedPOS(String tag) {
        return unwantedPosRE.matcher(tag).matches();
    }
}

Meanwhile the tokenizer uses and depends on google guava array while the 
lemmatizer uses and depends on stanford pos tagger.

Thanks.


> pylucene CharArraySet jvm error
> -------------------------------
>
>                 Key: PYLUCENE-32
>                 URL: https://issues.apache.org/jira/browse/PYLUCENE-32
>             Project: PyLucene
>          Issue Type: Question
>         Environment: I added a customized lucene analyzer class to lucene 
> core in Pylucene. This class is google guava as a dependency because of the 
> array handling function available in com.google.common.collect.Iterables in 
> guava. When I tried to index using this analyzer, I got the following error: 
> Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in 
> lucene.initVM() JavaError: java.lang.NoClassDefFoundError: 
> org/apache/lucene/analysis/CharArraySet Java stacktrace: 
> java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.lucene.analysis.CharArraySet at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> Even the example indexing code in Lucene in Action that I tried earlier and 
> worked, when I retried it after adding this class is returning the same error 
> above. Am not too familiar with CharArraySet class as I can see the problem 
> is from it. How do i handle this? Thanks
>            Reporter: Alex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to