Re: Size of training data

Jörn Kottmann Fri, 26 Apr 2013 04:42:00 -0700

The Two Pass Data Indexer is the default, if you have a machine with enough
memory you might wanna try the One Pass Data Indexer.
Anyway, it would be nice to get a jstack to see where is spending its time,
maybe there is an I/O issue?


The training can take very long, but the data indexing should work.

To change the indexer you can set this parameter:
DataIndexer=OnePass

HTH,
Jörn

On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:

I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, "Jörn Kottmann" <[email protected]> wrote:

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:

  From command line you can specify memory using

MAVEN_OPTS="-Xmx4048m"

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
[email protected]> wrote:

I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during "computing event
counts...", at 16G during indexing: "Computing event counts...  done.
50153300 events
          IndexingŠ"

Svetoslav

On 2013-04-26 09:12, "Jörn Kottmann" <[email protected]> wrote:

On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a
NER
model? I have a corpus of 2 600 000 sentences annotated with just one
category, 310M in size. However, the training never finishes  8G
memory
resulted in java out of memory exception, and when I increased it to
16G
it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory
exception?
Where did it just die when you used 16G of memory (maybe do a jstack)
?

Jörn

Re: Size of training data

Reply via email to