Hi Jörn,
Thank you for your reply.
The POS Tagger looks like a good lead.
For training data, can’t we just use the split positions suggested in 4.1?
(Ie. Use GIZA++ to produce word allignments on a source sentence c = {c1,
c2, . . . , cn},
its counterpart target sentence e = {e1, e2, . . . , em} .
If we can find a consecutive source sequence cji that is mapped to a
consecutive target sequence ekh,
cji as a splittable segment.
For a position k, if we can find a splittable segment cjk(j > k)
or cki (k > i), k is regarded as a split position, otherwise it is
not.)
Warm Regards,
Karthika.
On 2/8/18, 11:29 PM, "Joern Kottmann" <[email protected]> wrote:
Hello,
reading 4.1 carefully I think it is more or less just what the POS
Tagger we have does. Its by default using Maxent and it can be
configured to use the token-window features as in section 4.1. Also it
can be trained to only predict two tags, in the case of 4.1 it would
be SPLIT and NO_SPLIT. With Maxent it is often not so important which
features you pick, better features often only help to gain a bit more
accuracy.
The POS Tagger can be trained and evaluated via the command line
rather quickly and can give you an answer to how well this will work
from a splitting perspective.
The training format looks like this
... he_NO_SPLIT thought_NO_SPLIT ,_SPLIT Jeff_NO_SPLIT
signs_NO_SPLIT on_NO_SPLIT
Have a look at our documentation, or ask here if you need more help [1].
The more tricky part is probably to compile the training data for this.
HTH,
Jörn
[1] https://opennlp.apache.org/docs/1.8.4/manual/opennlp.
html#tools.postagger.tagging
On Thu, Feb 8, 2018 at 3:19 PM, karthika nair <[email protected]>
wrote:
> Hello there,
>
>
>
> We use Machine Translation as one of our components for translations.
We
> call AWS Translate downstream for short sentences and it performs
decently
> well. However, being a neural MT system, it fails on longer
sentences. Our
> metadata assets – (long synopsis, short synopsis) are typically
sentences
> of length ~40words (or more!). AWS Translate often loses context,
skips
> words and garbles meaning, resulting in poor translations.
>
>
>
> We are currently looking at sentence segmentation into phrases and
getting
> those individual phrases translated and concatenated back. (ie.
> Implementing this paper
> <http://tcci.ccf.org.cn/conference/2016/papers/72.pdf>). However, the
split
> model described is ambiguous about the feature defined(Specifically
> Equation 11 in Section 4.1). Has anyone here come across this problem
/
> knows of any other approaches we could try for translating long
sentences?
>
>
>
> Here’s an example of a long sentence –
>
> When returning to his old law practice proves harder than he thought,
Jeff
> signs on to help his longtime nemesis Alan Connor represent Marvin
> Humphries, a Greendale Community College-trained engineer who
designed a
> bridge that collapsed. To keep the school from shredding the evidence
of
> his client’s shoddy education, Alan asks Jeff to steal his records so
he
> can use them in court..
>
>
>
> We’d like this broken into
>
> 1. When returning to his old law practice proves harder than he
thought,
> 2. Jeff signs on to help his longtime nemesis Alan Connor represent
> Marvin Humphries,
> 3. a Greendale Community College-trained engineer who designed a
bridge
> that collapsed
>
> (We’ve verified that these clauses get translated correctly.)
>
>
>
> Thank you.
>
> Warm Regards,
>
> Karthika.