Hello, reading 4.1 carefully I think it is more or less just what the POS Tagger we have does. Its by default using Maxent and it can be configured to use the token-window features as in section 4.1. Also it can be trained to only predict two tags, in the case of 4.1 it would be SPLIT and NO_SPLIT. With Maxent it is often not so important which features you pick, better features often only help to gain a bit more accuracy.
The POS Tagger can be trained and evaluated via the command line rather quickly and can give you an answer to how well this will work from a splitting perspective. The training format looks like this ... he_NO_SPLIT thought_NO_SPLIT ,_SPLIT Jeff_NO_SPLIT signs_NO_SPLIT on_NO_SPLIT Have a look at our documentation, or ask here if you need more help [1]. The more tricky part is probably to compile the training data for this. HTH, Jörn [1] https://opennlp.apache.org/docs/1.8.4/manual/opennlp.html#tools.postagger.tagging On Thu, Feb 8, 2018 at 3:19 PM, karthika nair <[email protected]> wrote: > Hello there, > > > > We use Machine Translation as one of our components for translations. We > call AWS Translate downstream for short sentences and it performs decently > well. However, being a neural MT system, it fails on longer sentences. Our > metadata assets – (long synopsis, short synopsis) are typically sentences > of length ~40words (or more!). AWS Translate often loses context, skips > words and garbles meaning, resulting in poor translations. > > > > We are currently looking at sentence segmentation into phrases and getting > those individual phrases translated and concatenated back. (ie. > Implementing this paper > <http://tcci.ccf.org.cn/conference/2016/papers/72.pdf>). However, the split > model described is ambiguous about the feature defined(Specifically > Equation 11 in Section 4.1). Has anyone here come across this problem / > knows of any other approaches we could try for translating long sentences? > > > > Here’s an example of a long sentence – > > When returning to his old law practice proves harder than he thought, Jeff > signs on to help his longtime nemesis Alan Connor represent Marvin > Humphries, a Greendale Community College-trained engineer who designed a > bridge that collapsed. To keep the school from shredding the evidence of > his client’s shoddy education, Alan asks Jeff to steal his records so he > can use them in court.. > > > > We’d like this broken into > > 1. When returning to his old law practice proves harder than he thought, > 2. Jeff signs on to help his longtime nemesis Alan Connor represent > Marvin Humphries, > 3. a Greendale Community College-trained engineer who designed a bridge > that collapsed > > (We’ve verified that these clauses get translated correctly.) > > > > Thank you. > > Warm Regards, > > Karthika.
