Re: Training NameFinder with large corpus

melo Mon, 07 Oct 2013 06:51:54 -0700

I see.

For my experience.
4G is too small. 
I set -Xmx16G every time.
When you iterate more times, the memory won't be released due to some reason.
Any way, I think you should first find a machine has at lease 16G.


Actually you will know the program slows down, which means java is trying to 
collect garage memory.
You should stop the program and try some other parameter immediately.
never wait another day.

Another way, I think you should try is set the parameter from the lowest to the 
highest by mid-separation way.
Then you could find the boundary in log(n) times of trying (where n is the 
possible parameter settings).

Anyway, 4gb of memory is not acceptable as a NLP task..


Good luck.

Gao


On 2013/10/07, at 22:42, Jeffrey Zemerick <[email protected]> wrote:

> Gao,
> 
> I have about a 950 MB file created by Hadoop with sentences in the format
> described in the NameFinder training documentation (
> http://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind.training.tool).
> I'm running the jar as described on that page and I set the number of
> iterations to 50. (I read somewhere that was a suggested amount.) After the
> first failed attempt I increased the memory to 4096 but it failed again
> (just took longer to fail). I can increase the memory further but I wanted
> to see if there was anything that I was missing.
> 
> Thanks,
> Jeff
> 
> 
> 
> On Mon, Oct 7, 2013 at 9:29 AM, melo <[email protected]> wrote:
> 
>> Jeff,
>> 
>> Would you please tell us what exactly kind of method are you using?
>> 
>> Are you calling the .jar file? or u r writing  a new class to use the
>> model.
>> 
>> honestly speaking, I don't think you should get involve with hadoop.
>> It is supposed to handle tremendously more data than yours 1Giga.
>> By tremendous, I mean TeraByte, maybe PetaByte.
>> 
>> There is always a way.
>> Learning Hadoop is not so hard, but why bother?
>> 
>> Gao
>> 
>> On 2013/10/07, at 22:21, Mark G <[email protected]> wrote:
>> 
>>> Also, Map Reduce will allow you to write the annotated sentences to HDFS
>> as
>>> part files, but at some point those files will have to be merged and the
>>> model created from them. In Map Reduce you may find that all your part
>>> files end up on the same reducer node and you end up with the same
>> problem
>>> on a random data node.
>>> Seems like this would only work if you could append one MODEL with
>> another
>>> without recalculation.
>>> 
>>> 
>>> On Mon, Oct 7, 2013 at 8:23 AM, Jörn Kottmann <[email protected]>
>> wrote:
>>> 
>>>> On 10/07/2013 02:05 PM, Jeffrey Zemerick wrote:
>>>> 
>>>>> Thanks. I used MapReduce to build the training input. I didn't realize
>>>>> that
>>>>> the training can also be performed on Hadoop. Can I simply combine the
>>>>> generated models at the completion of the job?
>>>>> 
>>>> 
>>>> That will not be an out of the box experience, you need to modify
>> OpenNLP
>>>> to write the training events
>>>> to a file and then use a trainer which can run on Hadoop e.g. Mahout.
>> We
>>>> now almost have support
>>>> to integrate 3rd party ml libraries into OpenNLP.
>>>> 
>>>> Jörn
>>>> 
>> 
>>

Re: Training NameFinder with large corpus

Reply via email to