Re: Pluggable preprocessing and OpenNLP

Matt Post Wed, 18 Jan 2017 09:24:07 -0800

Hi,

Sorry, what file format are you talking about? Can you point me to an example 
of the Moses file format? Is this just plain text, one sentence per line?


In general the Moses format is the standard, to the extent that there are any 
standards in MT (they are all mostly informal).

matt

PS. Are you on dev@joshua, or do I need to keep CC'ing you at your address?


> On Jan 16, 2017, at 5:42 PM, Joern Kottmann <[email protected]> wrote:
> 
> Hello,
> 
> we came to the conclusion that it would make sense to add direct
> formats support for letsmt and moses files.
> 
> Here our two issues:
> https://issues.apache.org/jira/browse/OPENNLP-938
> https://issues.apache.org/jira/browse/OPENNLP-939
> 
> Does it make sense for you if we support those formats?
> Did we miss an important format?
> 
> The training works quite fine, but it will take me a bit more time to
> get the evaluation to return something useful. The OpenNLP Sentence
> Detector can only split on end-of-sentence (eos) chars. And if there is
> a sentence without an eos chars it gets treated as a mistake by the
> evaluation.
> 
> Do you have a specific language which would be good for testing for
> you?
> 
> The tokenizer can probably trained as well, I saw a couple of tokenized
> data sets. Maybe that makes sense for you too.
> 
> Jörn
> 
> 
> 
> On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
>> Hi Jörn,
>> 
>> [Sent again without the picture since Apache rejects those,
>> unfortunately...]
>> 
>> You just need monolingual text, so I suggest downloading either the
>> tokenized or untokenized versions. Unfortunately, Opus doesn't make
>> it easy to provide directly links to individual languages. But do
>> this:
>> 
>> 1. Go to http://opus.lingfil.uu.se
>> 
>> 2. Choose de → en (or some other language pair)
>> 
>> 3. In the "mono" or "raw" columns (depending on whether you want
>> tokenized or untokenized text), click the language file for the
>> dataset you want.
>> 
>> matt
>> 
>> 
>>> On Jan 12, 2017, at 6:07 AM, Joern Kottmann <[email protected]>
>>> wrote:
>>> 
>>> Do you have a pointer to an actual file? Or download package?
>>> 
>>> Jörn
>>> 
>>> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teofili@
>>> gmail.com
>>>> wrote:
>>>> I think the parallel corpuses are taken from [1], so we could
>>>> start with
>>>> training sentdetect for language packs at [2].
>>>> 
>>>> Regards,
>>>> Tommaso
>>>> 
>>>> [1] : http://opus.lingfil.uu.se/
>>>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
>>>> +Packs
>>>> 
>>>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottmann@
>>>> gmail.com
>>>> ha scritto:
>>>> 
>>>>> Sorry, for late reply, can you point me to a link for the
>>>>> parallel
>>>> corpus?
>>>>> We might just want to add formats support for it to OpenNLP.
>>>>> 
>>>>> Do you use tokenize.pl for all languages or do you have
>>>>> language
>>>> specific
>>>>> heuristics?
>>>>> It would be great to have an additional more capable rule based
>>>>> tokenizer
>>>>> in OpenNLP.
>>>>> 
>>>>> The sentence splitter can be trained on a few thousand
>>>>> sentences or so, I
>>>>> think that will work out nicely.
>>>>> 
>>>>> Jörn
>>>>> 
>>>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottmann@gmai
>>>>>>> l.com>
>>>>> wrote:
>>>>>>> I am happy to support a bit with this, we can also see if
>>>>>>> things in
>>>>>> OpenNLP
>>>>>>> need to be changed to make this work smoothly.
>>>>>> 
>>>>>> Great!
>>>>>> 
>>>>>> 
>>>>>>> One challenge is to train OpenNLP on all the languages you
>>>>>>> support.
>>>> Do
>>>>>> you
>>>>>>> have training data that could be used to train the
>>>>>>> tokenizer and
>>>>> sentence
>>>>>>> detector?
>>>>>> 
>>>>>> For the sentence-splitter, I imagine you could make use of
>>>>>> the source
>>>>> side
>>>>>> of our parallel corpus, which has thousands to millions of
>>>>>> sentences,
>>>> one
>>>>>> per line.
>>>>>> 
>>>>>> For tokenization (and normalization), we don't typically
>>>>>> train models
>>>> but
>>>>>> instead use a set of manually developed heuristics, which may
>>>>>> or may
>>>> not
>>>>> be
>>>>>> sentence-specific. See
>>>>>> 
>>>>>>        https://github.com/apache/incubator-joshua/blob/master
>>>>>> /
>>>>>> scripts/preparation/tokenize.pl
>>>>>> 
>>>>>> How much training data do you generally need for each task?
>>>>>> 
>>>>>> 
>>>>>>> Jörn
>>>>>>> 
>> 
>>

Re: Pluggable preprocessing and OpenNLP

Reply via email to