Re: Size of training data
Hi again, Below is a jstack output. It is not the third day it is running and seems like the process has hung up somewhere. I still haven't changed the indexer to be one pass, so it is still two pass. I just wonder how long I should wait? Thanks! Svetoslav -- Indexing events using cutoff of 6 Computing event counts... 2013-04-26 14:37:22 Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode): Low Memory Detector daemon prio=10 tid=0x7f31d009d800 nid=0xe272 runnable [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread1 daemon prio=10 tid=0x7f31d009b000 nid=0xe271 waiting on condition [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread0 daemon prio=10 tid=0x7f31d0098800 nid=0xe270 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Signal Dispatcher daemon prio=10 tid=0x7f31d008a000 nid=0xe26f waiting on condition [0x] java.lang.Thread.State: RUNNABLE Finalizer daemon prio=10 tid=0x7f31d0078000 nid=0xe26e in Object.wait() [0x7f31ca3db000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x000400b94808 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133) - locked 0x000400b94808 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177) Reference Handler daemon prio=10 tid=0x7f31d0076000 nid=0xe26d in Object.wait() [0x7f31ca4dc000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x000400b947a0 (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:502) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked 0x000400b947a0 (a java.lang.ref.Reference$Lock) main prio=10 tid=0x7f31d0007800 nid=0xe267 runnable [0x7f31d8923000] java.lang.Thread.State: RUNNABLE at java.nio.ByteBuffer.wrap(ByteBuffer.java:367) at java.nio.ByteBuffer.wrap(ByteBuffer.java:390) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:254) at java.lang.StringCoding.encode(StringCoding.java:289) at java.lang.String.getBytes(String.java:954) at opennlp.model.HashSumEventStream.next(HashSumEventStream.java:55) at opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java :127) at opennlp.model.TwoPassDataIndexer.init(TwoPassDataIndexer.java:81) at opennlp.model.TrainUtil.train(TrainUtil.java:173) at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366) at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53) VM Thread prio=10 tid=0x7f31d0071000 nid=0xe26c runnable GC task thread#0 (ParallelGC) prio=10 tid=0x7f31d0012800 nid=0xe268 runnable GC task thread#1 (ParallelGC) prio=10 tid=0x7f31d0014800 nid=0xe269 runnable GC task thread#2 (ParallelGC) prio=10 tid=0x7f31d0016000 nid=0xe26a runnable GC task thread#3 (ParallelGC) prio=10 tid=0x7f31d0018000 nid=0xe26b runnable VM Periodic Task Thread prio=10 tid=0x7f31d00a nid=0xe273 waiting on condition JNI global references: 1139 Heap PSYoungGen total 2581440K, used 2388216K [0x0006aaab, 0x00074b59, 0x0008) eden space 2530304K, 94% used [0x0006aaab,0x00073c6ee120,0x0007451b) from space 51136K, 0% used [0x0007451b,0x0007451b,0x0007483a) to space 48512K, 0% used [0x00074863,0x00074863,0x00074b59) PSOldGentotal 167168K, used 167167K [0x0004, 0x00040a34, 0x0006aaab) object space 167168K, 99% used [0x0004,0x00040a33fff0,0x00040a34) PSPermGen total 21248K, used 4039K [0x0003f5a0, 0x0003f6ec, 0x0004) object space 21248K, 19% used [0x0003f5a0,0x0003f5df1fe8,0x0003f6ec) 2013-04-26 14:39:09 Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode): Low Memory Detector daemon prio=10 tid=0x7f31d009d800 nid=0xe272 runnable [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread1 daemon prio=10 tid=0x7f31d009b000 nid=0xe271 waiting on condition [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread0 daemon prio=10 tid=0x7f31d0098800 nid=0xe270 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Signal Dispatcher daemon prio=10 tid=0x7f31d008a000 nid=0xe26f waiting on condition [0x] java.lang.Thread.State: RUNNABLE Finalizer daemon prio=10 tid=0x7f31d0078000 nid=0xe26e in Object.wait()
Size of training data
Hi all, I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes – 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Any help about this would be highly appreciated. Svetoslav
Re: Size of training data
On 04/26/2013 09:06 AM, Svetoslav Marinov wrote: I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes – 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Do you use the command line interface or the API for the training? At which stage of the training did you get the out of memory exception? Where did it just die when you used 16G of memory (maybe do a jstack) ? Jörn
Re: Size of training data
From command line you can specify memory using MAVEN_OPTS=-Xmx4048m You can also set it as JVM arguments if you are using from the API: java -Xmx4048m ... On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov svetoslav.mari...@findwise.com wrote: I use the API. Can one specify the memory size via the command line? I think the default there is 1024M? At 8G memory during computing event counts..., at 16G during indexing: Computing event counts... done. 50153300 events IndexingŠ Svetoslav On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote: On 04/26/2013 09:06 AM, Svetoslav Marinov wrote: I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Do you use the command line interface or the API for the training? At which stage of the training did you get the out of memory exception? Where did it just die when you used 16G of memory (maybe do a jstack) ? Jörn
Re: Size of training data
I always edit the opennlp script and change it to what I need. Anyway, we have a Two Pass Data Indexer which writes the features to disk to save memory during indexing, depending on how you train you might have a cutoff=5 which eliminates probably a lot of your features and therefore saves a lot of memory. The indexing might just need a bit of time, how long did you wait? Jörn On 04/26/2013 12:33 PM, William Colen wrote: From command line you can specify memory using MAVEN_OPTS=-Xmx4048m You can also set it as JVM arguments if you are using from the API: java -Xmx4048m ... On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov svetoslav.mari...@findwise.com wrote: I use the API. Can one specify the memory size via the command line? I think the default there is 1024M? At 8G memory during computing event counts..., at 16G during indexing: Computing event counts... done. 50153300 events IndexingŠ Svetoslav On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote: On 04/26/2013 09:06 AM, Svetoslav Marinov wrote: I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Do you use the command line interface or the API for the training? At which stage of the training did you get the out of memory exception? Where did it just die when you used 16G of memory (maybe do a jstack) ? Jörn
Re: Size of training data
I prefer the API as it gives me more flexibility and fits the overall architecture of our components. But here is part of my set-up: Cutoff 6 Iterations 200 CustomFeatureGenerator with looking at the 4 previous and 2 subsequent tokens. So, I gave it a whole night and I saw the process was dead in the morning. But I'll give it another try and will let you know. Thank you! Svetoslav On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote: I always edit the opennlp script and change it to what I need. Anyway, we have a Two Pass Data Indexer which writes the features to disk to save memory during indexing, depending on how you train you might have a cutoff=5 which eliminates probably a lot of your features and therefore saves a lot of memory. The indexing might just need a bit of time, how long did you wait? Jörn On 04/26/2013 12:33 PM, William Colen wrote: From command line you can specify memory using MAVEN_OPTS=-Xmx4048m You can also set it as JVM arguments if you are using from the API: java -Xmx4048m ... On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov svetoslav.mari...@findwise.com wrote: I use the API. Can one specify the memory size via the command line? I think the default there is 1024M? At 8G memory during computing event counts..., at 16G during indexing: Computing event counts... done. 50153300 events IndexingŠ Svetoslav On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote: On 04/26/2013 09:06 AM, Svetoslav Marinov wrote: I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Do you use the command line interface or the API for the training? At which stage of the training did you get the out of memory exception? Where did it just die when you used 16G of memory (maybe do a jstack) ? Jörn
Re: Size of training data
The Two Pass Data Indexer is the default, if you have a machine with enough memory you might wanna try the One Pass Data Indexer. Anyway, it would be nice to get a jstack to see where is spending its time, maybe there is an I/O issue? The training can take very long, but the data indexing should work. To change the indexer you can set this parameter: DataIndexer=OnePass HTH, Jörn On 04/26/2013 01:17 PM, Svetoslav Marinov wrote: I prefer the API as it gives me more flexibility and fits the overall architecture of our components. But here is part of my set-up: Cutoff 6 Iterations 200 CustomFeatureGenerator with looking at the 4 previous and 2 subsequent tokens. So, I gave it a whole night and I saw the process was dead in the morning. But I'll give it another try and will let you know. Thank you! Svetoslav On 2013-04-26 12:42, Jörn Kottmann kottm...@gmail.com wrote: I always edit the opennlp script and change it to what I need. Anyway, we have a Two Pass Data Indexer which writes the features to disk to save memory during indexing, depending on how you train you might have a cutoff=5 which eliminates probably a lot of your features and therefore saves a lot of memory. The indexing might just need a bit of time, how long did you wait? Jörn On 04/26/2013 12:33 PM, William Colen wrote: From command line you can specify memory using MAVEN_OPTS=-Xmx4048m You can also set it as JVM arguments if you are using from the API: java -Xmx4048m ... On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov svetoslav.mari...@findwise.com wrote: I use the API. Can one specify the memory size via the command line? I think the default there is 1024M? At 8G memory during computing event counts..., at 16G during indexing: Computing event counts... done. 50153300 events IndexingŠ Svetoslav On 2013-04-26 09:12, Jörn Kottmann kottm...@gmail.com wrote: On 04/26/2013 09:06 AM, Svetoslav Marinov wrote: I'm wondering what is the max size (if such exists) for training a NER model? I have a corpus of 2 600 000 sentences annotated with just one category, 310M in size. However, the training never finishes 8G memory resulted in java out of memory exception, and when I increased it to 16G it just died with no error message. Do you use the command line interface or the API for the training? At which stage of the training did you get the out of memory exception? Where did it just die when you used 16G of memory (maybe do a jstack) ? Jörn