Re: [Apertium-stuff] Extend weighted transfer rules GSoC proposal

Aboelhamd Aly Fri, 19 Apr 2019 12:30:12 -0700

Hi Sevilay. Hi Francis,

Unfortunately, Sevilay reported that the evaluation results of kaz-tur and
spa-eng pairs were very bad with 30% of the tested sentences were good,
compared to apertium LRLM resolution.
So we discussed what to do next and it is to utilize the breakthrough of
deep learning neural networks in NLP and especially machine translations.
Also we discussed about using different values of n more than 5 in the
already used n-gram language model. And to evaluate the result of
increasing value of n, which could give us some more insights in what to do
next and how to do it.


Since I have an intro to deep learning subject this term in college, I
waited this past two weeks to be introduced to the application of deep
learning in NLP and MTs.
Now, I have the basics of knowledge in Recurrent Neural Networks (RNNs) and
why to use it instead of the standard network in NLP, beside understanding
the different architectures of it and the math done in the forward and back
propagation.
Also besides knowing how to build a simple language model, and avoiding the
problem of (vanishing gradient) leading to not capturing long dependencies,
by using Gated Recurrent Units (GRus) and Long Short Term Memory (LSTM)
network.

For next step, we will consider working only on the language model and to
let the max entropy part for later discussions.
So along with trying different n values in the n-gram language model and
evaluate the results, I will try either to use a ready RNNLM or to build a
new one from scratch from what I learnt so far. Honestly I prefer the last
choice because it will increase my experience in applying what I have
learnt.
In last 2 weeks I implemented RNNs with GRUs and LSTM and also implemented
a character based language model as two assignments and they were very fun
to do. So implementing a RNNs word based character LM will not take much
time, though it may not be close to the state-of-the-art model and this is
the disadvantage of it.

Using NNLM instead of the n-gram LM has these possible advantages :
- Automatically learn such syntactic and semantic features.
- Overcome the curse of dimensionality by generating better generalizations.

----------------------------------------------

I tried using n=8 instead of 5 in the n-gram LM, but the scores weren't
that different as Sevilay pointed out in our discussion.
I knew that NNLM is better than statistical one, also that using machine
learning instead of maximum entropy model will give better performance.
*But* the evaluation results were very very disappointing, unexpected and
illogical, so I thought there might be a bug in the code.
And after some search, I found that I did a very very silly *mistake* in
normalizing the LM scores. As the scores are log base 10 of the sentence
probability, then the higher in magnitude has the lower probability, but I
what I did was the inverse of that, and that was the cause of the very bad
results.

I am fixing this now and then will re-evaluate the results with Sevilay.

Regards,
Aboelhamd


On Sun, Apr 7, 2019 at 6:46 PM Aboelhamd Aly <aboelhamd.abotr...@gmail.com>
wrote:

> Thanks Sevilay for your feedback, and thanks for the resources.
>
> On Sun, 7 Apr 2019, 18:42 Sevilay Bayatlı <sevilaybaya...@gmail.com wrote:
>
>> hi Aboelhamd,
>>
>> Your proposal looks good, I found these resource may be will be benefit.
>>
>>
>>
>> <https://arxiv.org/pdf/1601.00710>
>> Multi-source *neural translation* <https://arxiv.org/abs/1601.00710>
>> https://arxiv.org/abs/1601.00710
>>
>>
>> <https://arxiv.org/pdf/1708.05943>
>> *Neural machine translation *with extended context
>> <https://arxiv.org/abs/1708.05943>
>> https://arxiv.org/abs/1708.05943
>>
>> Handling homographs in *neural machine translation*
>> <https://arxiv.org/abs/1708.06510>https://arxiv.org/abs/1708.06510
>>
>>
>>
>> Sevilay
>>
>> On Sun, Apr 7, 2019 at 7:14 PM Aboelhamd Aly <
>> aboelhamd.abotr...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I got a not solid yet idea as an alternative to yasmet and max entropy
>>> models.
>>> And it's by using neural networks to give us scores for the ambiguous
>>> rules.
>>> But I didn't yet set a formulation for the problem nor the structure of
>>> the inputs, output and even the goal.
>>> As I think there are many formulations that we can adopt.
>>>
>>> For example, the most straightforward structure, is to give the network
>>> all the possible combinations
>>> of a sentence translations and let it choose the best one, or give them
>>> weights.
>>> Hence, make the network learns which combinations to choose for a
>>> specific pair.
>>>
>>> Another example, is instead of building one network per pair,
>>> we build one network per ambiguous pattern as we did with max entropy
>>> models.
>>> So we give to the network the combinations for that pattern,
>>> and let it assign some weights for the ambiguous rules applied to that
>>> pattern.
>>>
>>> And for each structure there are many details and questions to yet
>>> answer.
>>>
>>> So with that said, I decided to look at some papers to see what others
>>> have done before
>>> to tackle some similar problems or the exact problem, and how some of
>>> them used machine learning
>>> or deep learning to solve these problems, and then try build on them.
>>>
>>> Some papers resolution was very specific to the pairs they developed,
>>> thus were not very important to our case. :
>>> 1) Resolving Structural Transfer Ambiguity inChinese-to-Korean Machine
>>> Translation
>>> <https://www.worldscientific.com/doi/10.1142/S0219427903000887>.(2003)
>>> 2) Arabic Machine Translation: A Developmental Perspective
>>> <http://www.ieee.ma/IJICT/IJICT-SI-Bouzoubaa-3.3/2%20-%20paper_farghaly.pdf>
>>> .(2010)
>>>
>>> Some other papers tried not to generate ambiguous rules or to minimize
>>> the ambiguity in transfer rules inference, and didn't provide any methods
>>> to resolve the ambiguity in our case. I thought that they may provide some
>>> help, but I think they are far from our topic :
>>> 1) Learning Transfer Rules for Machine Translation with Limited Data
>>> <http://www.cs.cmu.edu/~kathrin/ThesisSummary/ThesisSummary.pdf>.(2005)
>>> 2) Inferring Shallow-Transfer Machine Translation Rulesfrom Small
>>> Parallel Corpora <https://arxiv.org/pdf/1401.5700.pdf>.(2009)
>>>
>>> Now I am looking into some more recent papers like :
>>> 1) Rule Based Machine Translation Combined with Statistical Post Editor
>>> for Japanese to English Patent Translation
>>> <http://www.mt-archive.info/MTS-2007-Ehara.pdf>.(2007)
>>> 2) Machine translation model using inductive logic programming
>>> <https://scholar.cu.edu.eg/?q=shaalan/files/101.pdf>.(2009)
>>> 3) Machine Learning for Hybrid Machine Translation
>>> <https://www.aclweb.org/anthology/W12-3138.pdf>.(2012)
>>> 4) Study and Comparison of Rule-Based and Statistical Catalan-Spanish
>>> Machine Translation Systems
>>> <https://pdfs.semanticscholar.org/a731/0d0c15b22381c7b372e783d122a5324b005a.pdf?_ga=2.89511443.981790355.1554651923-676013054.1554651923>
>>> .(2012)
>>> 5) Latest trends in hybrid machine translation and its applications
>>> <https://www.sciencedirect.com/science/article/pii/S0885230814001077>
>>> .(2015)
>>> 6) Machine Translation: Phrase-Based, Rule-Based and NeuralApproaches
>>> with Linguistic Evaluation
>>> <http://www.dfki.de/~ansr01/docs/MacketanzEtAl2017_CIT.pdf>.(2017)
>>> 7) A Multitask-Based Neural Machine Translation Model with
>>> Part-of-Speech Tags Integration for Arabic Dialects
>>> <https://www.mdpi.com/2076-3417/8/12/2502/htm>.(2018)
>>>
>>> And I hope they give me some more insights and thoughts.
>>>
>>> --------------
>>>
>>> - So do you have recommendations to other papers that refer to the same
>>> problem ?
>>> - Also about the proposal, I modified it a little bit and share it
>>> through GSoC website as a draft,
>>>  so do you have any last feedback or thoughts about it, or do I just
>>> submit it as a final proposal ?
>>> - Last thing for the coding challenge ( integrating weighted transfer
>>> rules with apertium-transfer ),
>>>  I think it's finished, and I didn't get any feedback or response about
>>> it, also the pull-request is not merged yet with master.
>>>
>>>
>>> Thanks,
>>> Aboelhamd
>>>
>>>
>>> On Sat, Apr 6, 2019 at 5:23 AM Aboelhamd Aly <
>>> aboelhamd.abotr...@gmail.com> wrote:
>>>
>>>> Hi Sevilay, hi spectei,
>>>>
>>>> For sentence splitting, I think that we don't need to know neither
>>>> syntax nor sentence boundaries of the language.
>>>> Also I don't see any necessity for applying it in runtime, as in
>>>> runtime we only get the score of each pattern,
>>>> where there is no need for splitting. I also had one thought on using
>>>> beam-search here as I see it has no effect
>>>> and may be I am wrong. We can discuss in it after we close this thread.
>>>>
>>>> We will handle the whole text as one unit and will depend only on the
>>>> captured patterns.
>>>> Knowing that in the chunker terms, successive patterns that don't share
>>>> a transfer rule, are independent.
>>>> So by using the lexical form of the text, we match the words with
>>>> patterns, then match patterns with rules.
>>>> And hence we know which patterns are ambiguous and how much ambiguous
>>>> rules they match.
>>>>
>>>> For example if we have text with the following patterns and
>>>> corresponding rules numbers:
>>>> p1:2  p2:1  p3:6  p4:4  p5:3  p6:5  p7:1  p8:4  p9:4  p10:6  p11:8
>>>> p12:5  p13:5  p14:1  p15:3  p16:2
>>>>
>>>> If such text was handled by our old method with generating all the
>>>> combinations possible (multiplication of rules numbers),
>>>> we would have 82944000 possible combinations, which are not practical
>>>> at all to score, and take heavy computations and memory.
>>>> And if it is handled by our new method with applying all ambiguous
>>>> rules of one pattern while fixing the other patterns at LRLM rule
>>>> (addition of rules numbers), we will have just 60 combinations, and not
>>>> all of them different, giving drastically low number of combinations,
>>>> which may be not so representative.
>>>>
>>>> But if we apply the splitting idea , we will have something in the
>>>> middle, that will hopefully avoid the disadvantages of both methods
>>>> and benefit from advantages of both, too.
>>>> Let's proceed from the start of the text to the end of it, while
>>>> maintaining some threshold of say 24000 combinations.
>>>> p1 => 2  ,,  p1  p2 => 2  ,,  p1  p2  p3 => 12  ,,  p1  p2  p3  p4 =>
>>>> 48  ,,  p1  p2  p3  p4  p5 => 144  ,,
>>>> p1  p2  p3  p4  p5  p6 => 720  ,,  p1  p2  p3  p4  p5  p6  p7 => 720
>>>> p1  p2  p3  p4  p5  p6  p7 p8 => 2880  ,,  p1  p2  p3  p4  p5  p6  p7
>>>> p8  p9 => 11520
>>>>
>>>> And then we stop here, because taking the next pattern will exceed the
>>>> threshold.
>>>> Hence having our first split, we can now continue our work on it as
>>>> usual.
>>>> But with more -non overwhelming- combinations which would capture more
>>>> semantics.
>>>> After that, we take the next split and so on.
>>>>
>>>> -----------
>>>>
>>>> I agree with you, that testing the current method with more than one
>>>> pair to know its accuracy is the priority,
>>>> and we currently working on it.
>>>>
>>>> -----------
>>>>
>>>> For an alternative for yasmet, I agree with spectei. Unfortunately, for
>>>> now I don't have a solid idea to discuss.
>>>> But in the few days, i will try to get one or more ideas to discuss.
>>>>
>>>>
>>>> On Fri, Apr 5, 2019 at 11:23 PM Francis Tyers <fty...@prompsit.com>
>>>> wrote:
>>>>
>>>>> El 2019-04-05 20:57, Sevilay Bayatlı escribió:
>>>>> > On Fri, 5 Apr 2019, 22:41 Francis Tyers, <fty...@prompsit.com>
>>>>> wrote:
>>>>> >
>>>>> >> El 2019-04-05 19:07, Sevilay Bayatlı escribió:
>>>>> >>> Hi Aboelhamd,
>>>>> >>>
>>>>> >>> There is some points in your proposal:
>>>>> >>>
>>>>> >>> First, I do not think "splitting sentence" is a good idea, each
>>>>> >>> language has different syntax, how could you know when you should
>>>>> >>> split the sentence.
>>>>> >>
>>>>> >> Apertium works on the concept of a stream of words, so in the
>>>>> >> runtime
>>>>> >> we can't really rely on robust sentence segmentation.
>>>>> >>
>>>>> >> We can often use it, e.g. for training, but if sentence boundary
>>>>> >> detection
>>>>> >> were to be included, it would need to be trained, as Sevilay hints
>>>>> >> at.
>>>>> >>
>>>>> >> Also, I'm not sure how much we would gain from that.
>>>>> >>
>>>>> >>> Second, "substitute yasmet with other method", I think the result
>>>>> >> will
>>>>> >>> not be more better if you substituted it with statistical method.
>>>>> >>>
>>>>> >>
>>>>> >> Substituting yasmet with a more up to date machine-learning method
>>>>> >> might be a worthwhile thing to do. What suggestions do you have?
>>>>> >>
>>>>> >> I think first we have to trying the exact method with more than 3
>>>>> >> language pairs and then decide  to substitute it or not, because
>>>>> >> what is the point of new method if dont achieve gain, then we can
>>>>> >> compare  the results of two methods and choose the best one. What do
>>>>> >> you think?
>>>>> >
>>>>>
>>>>> Yes, testing it with more language pairs is also a priority.
>>>>>
>>>>> Fran
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Extend weighted transfer rules GSoC proposal

Reply via email to