Re: Training NameFinder with large corpus

[email protected] Mon, 07 Oct 2013 07:20:57 -0700

Sorry, forgot to say.

CUT YOU DATA.



Good luck

Gao

Sent from my iPad

On 2013/10/07, at 22:42, Jeffrey Zemerick <[email protected]> wrote:

> Gao,
> 
> I have about a 950 MB file created by Hadoop with sentences in the format
> described in the NameFinder training documentation (
> http://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind.training.tool).
> I'm running the jar as described on that page and I set the number of
> iterations to 50. (I read somewhere that was a suggested amount.) After the
> first failed attempt I increased the memory to 4096 but it failed again
> (just took longer to fail). I can increase the memory further but I wanted
> to see if there was anything that I was missing.
> 
> Thanks,
> Jeff
> 
> 
> 
> On Mon, Oct 7, 2013 at 9:29 AM, melo <[email protected]> wrote:
> 
>> Jeff,
>> 
>> Would you please tell us what exactly kind of method are you using?
>> 
>> Are you calling the .jar file? or u r writing  a new class to use the
>> model.
>> 
>> honestly speaking, I don't think you should get involve with hadoop.
>> It is supposed to handle tremendously more data than yours 1Giga.
>> By tremendous, I mean TeraByte, maybe PetaByte.
>> 
>> There is always a way.
>> Learning Hadoop is not so hard, but why bother?
>> 
>> Gao
>> 
>> On 2013/10/07, at 22:21, Mark G <[email protected]> wrote:
>> 
>>> Also, Map Reduce will allow you to write the annotated sentences to HDFS
>> as
>>> part files, but at some point those files will have to be merged and the
>>> model created from them. In Map Reduce you may find that all your part
>>> files end up on the same reducer node and you end up with the same
>> problem
>>> on a random data node.
>>> Seems like this would only work if you could append one MODEL with
>> another
>>> without recalculation.
>>> 
>>> 
>>> On Mon, Oct 7, 2013 at 8:23 AM, Jörn Kottmann <[email protected]>
>> wrote:
>>> 
>>>> On 10/07/2013 02:05 PM, Jeffrey Zemerick wrote:
>>>> 
>>>>> Thanks. I used MapReduce to build the training input. I didn't realize
>>>>> that
>>>>> the training can also be performed on Hadoop. Can I simply combine the
>>>>> generated models at the completion of the job?
>>>> 
>>>> That will not be an out of the box experience, you need to modify
>> OpenNLP
>>>> to write the training events
>>>> to a file and then use a trainer which can run on Hadoop e.g. Mahout.
>> We
>>>> now almost have support
>>>> to integrate 3rd party ml libraries into OpenNLP.
>>>> 
>>>> Jörn
>> 
>>

Re: Training NameFinder with large corpus

Reply via email to