Re: Training NameFinder with large corpus

Mark G Mon, 07 Oct 2013 05:33:14 -0700

you should be able to add an -Xmx####m option when you start the jar that
does the training. For instance, my machine has 8GB ram, so I usually start
the process with
-Xmx7000m and run nothing else
for instance
java -jar -Xmx2048m foo.jar


I think you may be stuck with max of 2GB if you are running  32bit jvm....

Another option is to use the Java api to train ...
Below is an example of using the train api.
If you put this in a netbeans project, change the paths in the code to
point to your data, and right click the project, go to -> properties-> run,
you can add the -Xmx option there in the VM options box. Then when you
rightclick and select "run" on the project it will use the ram you told it
to.

try {

      System.out.println("\t\treading training data...");
      Charset charset = Charset.forName("UTF-8");///might need to change
this
      ObjectStream<String> lineStream =
              new PlainTextByLineStream(new FileInputStream(path +
"en-ner-person.train"), charset);//change the path
      ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);

      TokenNameFinderModel model;
      model = NameFinderME.train("en", "your type here", sampleStream,
null);//might need to change the language
      sampleStream.close();
      OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(new File(path + "en-ner-person.train.model")));
      model.serialize(modelOut);
      if (modelOut != null) {
        modelOut.close();
      }
      System.out.println("\tmodel generated");
    } catch (Exception e) {
    }

hope this helps, good luck
Mark G


On Mon, Oct 7, 2013 at 7:53 AM, Jeffrey Zemerick <[email protected]>wrote:

> Hi,
>
> I'm new to OpenNLP (and NLP in general) and I'm trying to train the
> NameFinder on a large corpus (nearly 1 GB). After a few hours it will fail
> with a GC overhead limit exception. Do you have any suggestions on how I
> might could accomplish this? Is it possible to train the model on parts of
> the input at a time? I tried increasing the memory available but that
> seemed to just prolong the exception.
>
> Thanks for any help.
>
> Jeff
>

Re: Training NameFinder with large corpus

Reply via email to