Size of training data

2013-04-26 Thread Svetoslav Marinov
Hi all,

I'm wondering what is the max size (if such exists) for training a NER model? I 
have a corpus of 2 600 000 sentences annotated with just one category, 310M in 
size. However, the training never finishes – 8G memory resulted in java out of 
memory exception, and when I increased it to 16G it just died with no error 
message.

Any help about this would be highly appreciated.

Svetoslav



Re: Size of training data

2013-04-26 Thread Jörn Kottmann

On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a NER model? I 
have a corpus of 2 600 000 sentences annotated with just one category, 310M in 
size. However, the training never finishes – 8G memory resulted in java out of 
memory exception, and when I increased it to 16G it just died with no error 
message.


Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory exception?
Where did it just die when you used 16G of memory (maybe do a jstack) ?

Jörn


Re: Size of training data

2013-04-26 Thread William Colen
From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:

 I use the API. Can one specify the memory size via the command line? I
 think the default there is 1024M? At 8G memory during computing event
 counts..., at 16G during indexing: Computing event counts...  done.
 50153300 events
 IndexingŠ

 Svetoslav

 On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
  I'm wondering what is the max size (if such exists) for training a NER
 model? I have a corpus of 2 600 000 sentences annotated with just one
 category, 310M in size. However, the training never finishes ­ 8G memory
 resulted in java out of memory exception, and when I increased it to 16G
 it just died with no error message.
 
 Do you use the command line interface or the API for the training?
 At which stage of the training did you get the out of memory exception?
 Where did it just die when you used 16G of memory (maybe do a jstack) ?
 
 Jörn
 





Re: Size of training data

2013-04-26 Thread Jörn Kottmann

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and 
therefore

saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:

 From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:


I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during computing event
counts..., at 16G during indexing: Computing event counts...  done.
50153300 events
 IndexingŠ

Svetoslav

On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:


On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a NER
model? I have a corpus of 2 600 000 sentences annotated with just one
category, 310M in size. However, the training never finishes ­ 8G memory
resulted in java out of memory exception, and when I increased it to 16G
it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory exception?
Where did it just die when you used 16G of memory (maybe do a jstack) ?

Jörn








Re: Size of training data

2013-04-26 Thread Svetoslav Marinov
I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote:

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:
  From command line you can specify memory using

 MAVEN_OPTS=-Xmx4048m

 You can also set it as JVM arguments if you are using from the API:

 java -Xmx4048m ...



 On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
 svetoslav.mari...@findwise.com wrote:

 I use the API. Can one specify the memory size via the command line? I
 think the default there is 1024M? At 8G memory during computing event
 counts..., at 16G during indexing: Computing event counts...  done.
 50153300 events
  IndexingŠ

 Svetoslav

 On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
 I'm wondering what is the max size (if such exists) for training a
NER
 model? I have a corpus of 2 600 000 sentences annotated with just one
 category, 310M in size. However, the training never finishes ­ 8G
memory
 resulted in java out of memory exception, and when I increased it to
16G
 it just died with no error message.
 Do you use the command line interface or the API for the training?
 At which stage of the training did you get the out of memory
exception?
 Where did it just die when you used 16G of memory (maybe do a jstack)
?

 Jörn








Re: Size of training data

2013-04-26 Thread Jörn Kottmann

The Two Pass Data Indexer is the default, if you have a machine with enough
memory you might wanna try the One Pass Data Indexer.
Anyway, it would be nice to get a jstack to see where is spending its time,
maybe there is an I/O issue?

The training can take very long, but the data indexing should work.

To change the indexer you can set this parameter:
DataIndexer=OnePass

HTH,
Jörn

On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:

I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote:


I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:

  From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:


I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during computing event
counts..., at 16G during indexing: Computing event counts...  done.
50153300 events
  IndexingŠ

Svetoslav

On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:


On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a
NER
model? I have a corpus of 2 600 000 sentences annotated with just one
category, 310M in size. However, the training never finishes ­ 8G
memory
resulted in java out of memory exception, and when I increased it to
16G
it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory
exception?
Where did it just die when you used 16G of memory (maybe do a jstack)
?

Jörn