Re: Size of training data

2013-04-29 Thread Svetoslav Marinov
Hi again, 

Below is a jstack output. It is not the third day it is running and seems
like the process has hung up somewhere. I still haven't changed the
indexer to be one pass, so it is still two pass.

I just wonder how long I should wait?

Thanks!

Svetoslav

--

Indexing events using cutoff of 6

Computing event counts...  2013-04-26 14:37:22
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):

Low Memory Detector daemon prio=10 tid=0x7f31d009d800 nid=0xe272
runnable [0x]
   java.lang.Thread.State: RUNNABLE

C2 CompilerThread1 daemon prio=10 tid=0x7f31d009b000 nid=0xe271
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

C2 CompilerThread0 daemon prio=10 tid=0x7f31d0098800 nid=0xe270
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Signal Dispatcher daemon prio=10 tid=0x7f31d008a000 nid=0xe26f
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Finalizer daemon prio=10 tid=0x7f31d0078000 nid=0xe26e in
Object.wait() [0x7f31ca3db000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x000400b94808 (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
- locked 0x000400b94808 (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

Reference Handler daemon prio=10 tid=0x7f31d0076000 nid=0xe26d in
Object.wait() [0x7f31ca4dc000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x000400b947a0 (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked 0x000400b947a0 (a java.lang.ref.Reference$Lock)

main prio=10 tid=0x7f31d0007800 nid=0xe267 runnable
[0x7f31d8923000]
   java.lang.Thread.State: RUNNABLE
at java.nio.ByteBuffer.wrap(ByteBuffer.java:367)
at java.nio.ByteBuffer.wrap(ByteBuffer.java:390)
at 
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:254)
at java.lang.StringCoding.encode(StringCoding.java:289)
at java.lang.String.getBytes(String.java:954)
at 
opennlp.model.HashSumEventStream.next(HashSumEventStream.java:55)
at 
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:127)
at 
opennlp.model.TwoPassDataIndexer.init(TwoPassDataIndexer.java:81)
at opennlp.model.TrainUtil.train(TrainUtil.java:173)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)

VM Thread prio=10 tid=0x7f31d0071000 nid=0xe26c runnable

GC task thread#0 (ParallelGC) prio=10 tid=0x7f31d0012800 nid=0xe268
runnable 

GC task thread#1 (ParallelGC) prio=10 tid=0x7f31d0014800 nid=0xe269
runnable 

GC task thread#2 (ParallelGC) prio=10 tid=0x7f31d0016000 nid=0xe26a
runnable 

GC task thread#3 (ParallelGC) prio=10 tid=0x7f31d0018000 nid=0xe26b
runnable 

VM Periodic Task Thread prio=10 tid=0x7f31d00a nid=0xe273
waiting on condition

JNI global references: 1139

Heap
 PSYoungGen  total 2581440K, used 2388216K [0x0006aaab,
0x00074b59, 0x0008)
  eden space 2530304K, 94% used
[0x0006aaab,0x00073c6ee120,0x0007451b)
  from space 51136K, 0% used
[0x0007451b,0x0007451b,0x0007483a)
  to   space 48512K, 0% used
[0x00074863,0x00074863,0x00074b59)
 PSOldGentotal 167168K, used 167167K [0x0004,
0x00040a34, 0x0006aaab)
  object space 167168K, 99% used
[0x0004,0x00040a33fff0,0x00040a34)
 PSPermGen   total 21248K, used 4039K [0x0003f5a0,
0x0003f6ec, 0x0004)
  object space 21248K, 19% used
[0x0003f5a0,0x0003f5df1fe8,0x0003f6ec)

2013-04-26 14:39:09
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):


Low Memory Detector daemon prio=10 tid=0x7f31d009d800 nid=0xe272
runnable [0x]
   java.lang.Thread.State: RUNNABLE

C2 CompilerThread1 daemon prio=10 tid=0x7f31d009b000 nid=0xe271
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

C2 CompilerThread0 daemon prio=10 tid=0x7f31d0098800 nid=0xe270
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Signal Dispatcher daemon prio=10 tid=0x7f31d008a000 nid=0xe26f
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

Finalizer daemon prio=10 tid=0x7f31d0078000 nid=0xe26e in
Object.wait() 

Size of training data

2013-04-26 Thread Svetoslav Marinov
Hi all,

I'm wondering what is the max size (if such exists) for training a NER model? I 
have a corpus of 2 600 000 sentences annotated with just one category, 310M in 
size. However, the training never finishes – 8G memory resulted in java out of 
memory exception, and when I increased it to 16G it just died with no error 
message.

Any help about this would be highly appreciated.

Svetoslav



Re: Size of training data

2013-04-26 Thread Jörn Kottmann

On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a NER model? I 
have a corpus of 2 600 000 sentences annotated with just one category, 310M in 
size. However, the training never finishes – 8G memory resulted in java out of 
memory exception, and when I increased it to 16G it just died with no error 
message.


Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory exception?
Where did it just die when you used 16G of memory (maybe do a jstack) ?

Jörn


Re: Size of training data

2013-04-26 Thread William Colen
From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:

 I use the API. Can one specify the memory size via the command line? I
 think the default there is 1024M? At 8G memory during computing event
 counts..., at 16G during indexing: Computing event counts...  done.
 50153300 events
 IndexingŠ

 Svetoslav

 On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
  I'm wondering what is the max size (if such exists) for training a NER
 model? I have a corpus of 2 600 000 sentences annotated with just one
 category, 310M in size. However, the training never finishes ­ 8G memory
 resulted in java out of memory exception, and when I increased it to 16G
 it just died with no error message.
 
 Do you use the command line interface or the API for the training?
 At which stage of the training did you get the out of memory exception?
 Where did it just die when you used 16G of memory (maybe do a jstack) ?
 
 Jörn
 





Re: Size of training data

2013-04-26 Thread Jörn Kottmann

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and 
therefore

saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:

 From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:


I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during computing event
counts..., at 16G during indexing: Computing event counts...  done.
50153300 events
 IndexingŠ

Svetoslav

On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:


On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a NER
model? I have a corpus of 2 600 000 sentences annotated with just one
category, 310M in size. However, the training never finishes ­ 8G memory
resulted in java out of memory exception, and when I increased it to 16G
it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory exception?
Where did it just die when you used 16G of memory (maybe do a jstack) ?

Jörn








Re: Size of training data

2013-04-26 Thread Svetoslav Marinov
I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote:

I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:
  From command line you can specify memory using

 MAVEN_OPTS=-Xmx4048m

 You can also set it as JVM arguments if you are using from the API:

 java -Xmx4048m ...



 On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
 svetoslav.mari...@findwise.com wrote:

 I use the API. Can one specify the memory size via the command line? I
 think the default there is 1024M? At 8G memory during computing event
 counts..., at 16G during indexing: Computing event counts...  done.
 50153300 events
  IndexingŠ

 Svetoslav

 On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:

 On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
 I'm wondering what is the max size (if such exists) for training a
NER
 model? I have a corpus of 2 600 000 sentences annotated with just one
 category, 310M in size. However, the training never finishes ­ 8G
memory
 resulted in java out of memory exception, and when I increased it to
16G
 it just died with no error message.
 Do you use the command line interface or the API for the training?
 At which stage of the training did you get the out of memory
exception?
 Where did it just die when you used 16G of memory (maybe do a jstack)
?

 Jörn








Re: Size of training data

2013-04-26 Thread Jörn Kottmann

The Two Pass Data Indexer is the default, if you have a machine with enough
memory you might wanna try the One Pass Data Indexer.
Anyway, it would be nice to get a jstack to see where is spending its time,
maybe there is an I/O issue?

The training can take very long, but the data indexing should work.

To change the indexer you can set this parameter:
DataIndexer=OnePass

HTH,
Jörn

On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:

I prefer the API as it gives me more flexibility and fits the overall
architecture of our components. But here is part of my set-up:

Cutoff 6
Iterations 200
CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
tokens.

So, I gave it a whole night and I saw the process was dead in the morning.
But I'll give it another try and will let you know.

Thank you!

Svetoslav


On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote:


I always edit the opennlp script and change it to what I need.

Anyway, we have a Two Pass Data Indexer which writes the features to disk
to save memory during indexing, depending on how you train you might
have a cutoff=5 which eliminates probably a lot of your features and
therefore
saves a lot of memory.

The indexing might just need a bit of time, how long did you wait?

Jörn

On 04/26/2013 12:33 PM, William Colen wrote:

  From command line you can specify memory using

MAVEN_OPTS=-Xmx4048m

You can also set it as JVM arguments if you are using from the API:

java -Xmx4048m ...



On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov 
svetoslav.mari...@findwise.com wrote:


I use the API. Can one specify the memory size via the command line? I
think the default there is 1024M? At 8G memory during computing event
counts..., at 16G during indexing: Computing event counts...  done.
50153300 events
  IndexingŠ

Svetoslav

On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote:


On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:

I'm wondering what is the max size (if such exists) for training a
NER
model? I have a corpus of 2 600 000 sentences annotated with just one
category, 310M in size. However, the training never finishes ­ 8G
memory
resulted in java out of memory exception, and when I increased it to
16G
it just died with no error message.

Do you use the command line interface or the API for the training?
At which stage of the training did you get the out of memory
exception?
Where did it just die when you used 16G of memory (maybe do a jstack)
?

Jörn