Hi Jörn,

Thank you for your reply.
The POS Tagger looks like a good lead.
For training data, can’t we just use the split positions suggested in 4.1?

(Ie. Use GIZA++ to produce word allignments on a source sentence c = {c1,
c2, . . . , cn},
 its counterpart target sentence e = {e1, e2, . . . , em} .
If we can find a consecutive source sequence cji that is mapped to a
consecutive target sequence ekh,
cji as a splittable segment.

For a position k, if we can find a splittable segment cjk(j > k)
or cki (k > i), k is regarded as a split position, otherwise it is
not.)

Warm Regards,
Karthika.

On 2/8/18, 11:29 PM, "Joern Kottmann" <[email protected]> wrote:

    Hello,

    reading 4.1 carefully I think it is more or less just what the POS
    Tagger we have does. Its by default using Maxent and it can be
    configured to use the token-window features as in section 4.1. Also it
    can be trained to only predict two tags, in the case of 4.1 it would
    be SPLIT and NO_SPLIT. With Maxent it is often not so important which
    features you pick, better features often only help to gain a bit more
    accuracy.

    The POS Tagger can be trained and evaluated via the command line
    rather quickly and can give you an answer to how well this will work
    from a splitting perspective.

    The training format looks like this
       ... he_NO_SPLIT thought_NO_SPLIT    ,_SPLIT    Jeff_NO_SPLIT
    signs_NO_SPLIT on_NO_SPLIT

    Have a look at our documentation, or ask here if you need more help [1].

    The more tricky part is probably to compile the training data for this.

    HTH,
    Jörn

    [1] https://opennlp.apache.org/docs/1.8.4/manual/opennlp.
html#tools.postagger.tagging

    On Thu, Feb 8, 2018 at 3:19 PM, karthika nair <[email protected]>
wrote:
    > Hello there,
    >
    >
    >
    > We use Machine Translation as one of our components for translations.
We
    > call AWS Translate downstream for short sentences and it performs
decently
    > well. However, being a neural MT system, it fails on longer
sentences. Our
    > metadata assets – (long synopsis, short synopsis) are typically
sentences
    > of length ~40words (or more!). AWS Translate often loses context,
skips
    > words and garbles meaning, resulting in poor translations.
    >
    >
    >
    > We are currently looking at sentence segmentation into phrases and
getting
    > those individual phrases translated and concatenated back. (ie.
    > Implementing this paper
    > <http://tcci.ccf.org.cn/conference/2016/papers/72.pdf>). However, the
split
    > model described is ambiguous about the feature defined(Specifically
    > Equation 11 in Section 4.1). Has anyone here come across this problem
/
    > knows of any other approaches we could try for translating long
sentences?
    >
    >
    >
    > Here’s an example of a long sentence –
    >
    > When returning to his old law practice proves harder than he thought,
Jeff
    > signs on to help his longtime nemesis Alan Connor represent Marvin
    > Humphries, a Greendale Community College-trained engineer who
designed a
    > bridge that collapsed. To keep the school from shredding the evidence
of
    > his client’s shoddy education, Alan asks Jeff to steal his records so
he
    > can use them in court..
    >
    >
    >
    > We’d like this broken into
    >
    >    1. When returning to his old law practice proves harder than he
thought,
    >    2. Jeff signs on to help his longtime nemesis Alan Connor represent
    >    Marvin Humphries,
    >    3. a Greendale Community College-trained engineer who designed a
bridge
    >    that collapsed
    >
    > (We’ve verified that these clauses get translated correctly.)
    >
    >
    >
    > Thank you.
    >
    > Warm Regards,
    >
    > Karthika.

Reply via email to