you should be able to add an -Xmx####m option when you start the jar that
does the training. For instance, my machine has 8GB ram, so I usually start
the process with
-Xmx7000m and run nothing else
for instance
java -jar -Xmx2048m foo.jar
I think you may be stuck with max of 2GB if you are running 32bit jvm....
Another option is to use the Java api to train ...
Below is an example of using the train api.
If you put this in a netbeans project, change the paths in the code to
point to your data, and right click the project, go to -> properties-> run,
you can add the -Xmx option there in the VM options box. Then when you
rightclick and select "run" on the project it will use the ram you told it
to.
try {
System.out.println("\t\treading training data...");
Charset charset = Charset.forName("UTF-8");///might need to change
this
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream(path +
"en-ner-person.train"), charset);//change the path
ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);
TokenNameFinderModel model;
model = NameFinderME.train("en", "your type here", sampleStream,
null);//might need to change the language
sampleStream.close();
OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(new File(path + "en-ner-person.train.model")));
model.serialize(modelOut);
if (modelOut != null) {
modelOut.close();
}
System.out.println("\tmodel generated");
} catch (Exception e) {
}
hope this helps, good luck
Mark G
On Mon, Oct 7, 2013 at 7:53 AM, Jeffrey Zemerick <[email protected]>wrote:
> Hi,
>
> I'm new to OpenNLP (and NLP in general) and I'm trying to train the
> NameFinder on a large corpus (nearly 1 GB). After a few hours it will fail
> with a GC overhead limit exception. Do you have any suggestions on how I
> might could accomplish this? Is it possible to train the model on parts of
> the input at a time? I tried increasing the memory available but that
> seemed to just prolong the exception.
>
> Thanks for any help.
>
> Jeff
>