Hi again,
Below is a jstack output. It is not the third day it is running and seems
like the process has hung up somewhere. I still haven't changed the
indexer to be one pass, so it is still two pass.
I just wonder how long I should wait?
Thanks!
Svetoslav
------------------------------
Indexing events using cutoff of 6
Computing event counts... 2013-04-26 14:37:22
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):
"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
- locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
java.lang.Thread.State: RUNNABLE
at java.nio.ByteBuffer.wrap(ByteBuffer.java:367)
at java.nio.ByteBuffer.wrap(ByteBuffer.java:390)
at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:254)
at java.lang.StringCoding.encode(StringCoding.java:289)
at java.lang.String.getBytes(String.java:954)
at
opennlp.model.HashSumEventStream.next(HashSumEventStream.java:55)
at
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:127)
at
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
at opennlp.model.TrainUtil.train(TrainUtil.java:173)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)
"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable
"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition
JNI global references: 1139
Heap
PSYoungGen total 2581440K, used 2388216K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
eden space 2530304K, 94% used
[0x00000006aaab0000,0x000000073c6ee120,0x00000007451b0000)
from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
to space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
PSOldGen total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
PSPermGen total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)
2013-04-26 14:39:09
Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):
"Low Memory Detector" daemon prio=10 tid=0x00007f31d009d800 nid=0xe272
runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007f31d009b000 nid=0xe271
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007f31d0098800 nid=0xe270
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007f31d008a000 nid=0xe26f
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007f31d0078000 nid=0xe26e in
Object.wait() [0x00007f31ca3db000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000400b94808> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
- locked <0x0000000400b94808> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
"Reference Handler" daemon prio=10 tid=0x00007f31d0076000 nid=0xe26d in
Object.wait() [0x00007f31ca4dc000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x0000000400b947a0> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007f31d0007800 nid=0xe267 runnable
[0x00007f31d8923000]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.copyOfRange(Arrays.java:3221)
at java.lang.String.<init>(String.java:233)
at java.lang.StringBuilder.toString(StringBuilder.java:447)
at
opennlp.tools.util.featuregen.TokenFeatureGenerator.createFeatures(TokenFea
tureGenerator.java:41)
at
opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowF
eatureGenerator.java:95)
at
opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(Agg
regatedFeatureGenerator.java:79)
at
opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedF
eatureGenerator.java:69)
at
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:118)
at
opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameCo
ntextGenerator.java:37)
at
opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEvent
Stream.java:103)
at
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:126)
at
opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventSt
ream.java:37)
at
opennlp.tools.util.AbstractEventStream.hasNext(AbstractEventStream.java:71)
at
opennlp.model.HashSumEventStream.hasNext(HashSumEventStream.java:47)
at
opennlp.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java
:126)
at
opennlp.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:81)
at opennlp.model.TrainUtil.train(TrainUtil.java:173)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:366)
at opennlptrainer.OpenNLPTrainer.main(OpenNLPTrainer.java:53)
"VM Thread" prio=10 tid=0x00007f31d0071000 nid=0xe26c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f31d0012800 nid=0xe268
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f31d0014800 nid=0xe269
runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f31d0016000 nid=0xe26a
runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f31d0018000 nid=0xe26b
runnable
"VM Periodic Task Thread" prio=10 tid=0x00007f31d00a0000 nid=0xe273
waiting on condition
JNI global references: 1139
Heap
PSYoungGen total 2581440K, used 2267572K [0x00000006aaab0000,
0x000000074b590000, 0x0000000800000000)
eden space 2530304K, 89% used
[0x00000006aaab0000,0x000000073511d138,0x00000007451b0000)
from space 51136K, 0% used
[0x00000007451b0000,0x00000007451b0000,0x00000007483a0000)
to space 48512K, 0% used
[0x0000000748630000,0x0000000748630000,0x000000074b590000)
PSOldGen total 167168K, used 167167K [0x0000000400000000,
0x000000040a340000, 0x00000006aaab0000)
object space 167168K, 99% used
[0x0000000400000000,0x000000040a33fff0,0x000000040a340000)
PSPermGen total 21248K, used 4039K [0x00000003f5a00000,
0x00000003f6ec0000, 0x0000000400000000)
object space 21248K, 19% used
[0x00000003f5a00000,0x00000003f5df1fe8,0x00000003f6ec0000)
On 2013-04-26 13:41, "Jörn Kottmann" <[email protected]> wrote:
>The Two Pass Data Indexer is the default, if you have a machine with
>enough
>memory you might wanna try the One Pass Data Indexer.
>Anyway, it would be nice to get a jstack to see where is spending its
>time,
>maybe there is an I/O issue?
>
>The training can take very long, but the data indexing should work.
>
>To change the indexer you can set this parameter:
>DataIndexer=OnePass
>
>HTH,
>Jörn
>
>On 04/26/2013 01:17 PM, Svetoslav Marinov wrote:
>> I prefer the API as it gives me more flexibility and fits the overall
>> architecture of our components. But here is part of my set-up:
>>
>> Cutoff 6
>> Iterations 200
>> CustomFeatureGenerator with looking at the 4 previous and 2 subsequent
>> tokens.
>>
>> So, I gave it a whole night and I saw the process was dead in the
>>morning.
>> But I'll give it another try and will let you know.
>>
>> Thank you!
>>
>> Svetoslav
>>
>>
>> On 2013-04-26 12:42, "Jörn Kottmann" <[email protected]> wrote:
>>
>>> I always edit the opennlp script and change it to what I need.
>>>
>>> Anyway, we have a Two Pass Data Indexer which writes the features to
>>>disk
>>> to save memory during indexing, depending on how you train you might
>>> have a cutoff=5 which eliminates probably a lot of your features and
>>> therefore
>>> saves a lot of memory.
>>>
>>> The indexing might just need a bit of time, how long did you wait?
>>>
>>> Jörn
>>>
>>> On 04/26/2013 12:33 PM, William Colen wrote:
>>>> From command line you can specify memory using
>>>>
>>>> MAVEN_OPTS="-Xmx4048m"
>>>>
>>>> You can also set it as JVM arguments if you are using from the API:
>>>>
>>>> java -Xmx4048m ...
>>>>
>>>>
>>>>
>>>> On Fri, Apr 26, 2013 at 4:30 AM, Svetoslav Marinov <
>>>> [email protected]> wrote:
>>>>
>>>>> I use the API. Can one specify the memory size via the command line?
>>>>>I
>>>>> think the default there is 1024M? At 8G memory during "computing
>>>>>event
>>>>> counts...", at 16G during indexing: "Computing event counts... done.
>>>>> 50153300 events
>>>>> IndexingŠ"
>>>>>
>>>>> Svetoslav
>>>>>
>>>>> On 2013-04-26 09:12, "Jörn Kottmann" <[email protected]> wrote:
>>>>>
>>>>>> On 04/26/2013 09:06 AM, Svetoslav Marinov wrote:
>>>>>>> I'm wondering what is the max size (if such exists) for training a
>>>>>>> NER
>>>>>>> model? I have a corpus of 2 600 000 sentences annotated with just
>>>>>>>one
>>>>>>> category, 310M in size. However, the training never finishes 8G
>>>>>>> memory
>>>>>>> resulted in java out of memory exception, and when I increased it
>>>>>>>to
>>>>>>> 16G
>>>>>>> it just died with no error message.
>>>>>> Do you use the command line interface or the API for the training?
>>>>>> At which stage of the training did you get the out of memory
>>>>>> exception?
>>>>>> Where did it just die when you used 16G of memory (maybe do a
>>>>>>jstack)
>>>>>> ?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>
>>>
>
>