I recommend reading the last mail in full-screen.



On Sun, Aug 11, 2019 at 1:50 PM Aboelhamd Aly <aboelhamd.abotr...@gmail.com>
wrote:

> Hi Francis, Sevilay, and all apertium mentors and contributors,
>
> I hope all of you are well and sound.
>
> This is an update for the progress on the sklearn SVM model :
>
> After some long research on using svm model instead of max entropy. There
> was a misunderstanding of the dataset we generate, that unfortunately
> slowed my research and took it into a distant way than it should have gone.
> The problem was in the weights (fractional count) for each sample, as I
> didn't find clear answer for my question "how to include this weights in
> svm and how it really works", and my search took me into reading some
> papers about new proposed svm models that solve this problem but the papers
> were either off-topic or not clear enough to try and re-implement their
> idea.
> And fortunately finally, I found the solution beneath my legs in
> scikit-learn library, as some classification models were given the option
> of accepting weighted samples in training and testing, these models were
> "NaiveBayes", "SVM", "DecisionTree", "RandomForest", "AdaBoost".
>
>     1. Yasmet data format
>          0 $ 0.335668 # debate_0:0 monográfico_1:0 importante_2:0 #
> debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
> importante_2:2 #
>          1 $ 0.329582 # debate_0:0 monográfico_1:0 importante_2:0 #
> debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
> importante_2:2 #
>          2 $ 0.33475 # debate_0:0 monográfico_1:0 importante_2:0 #
> debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
> importante_2:2 #
>
>         Sklearn data format
>         0  0.335668  debate  monográfico  importante
>         1  0.329582  debate  monográfico  importante
>         2  0.33475  debate  monográfico  importante
>
>
>     2. The dataset is data generated for one ambiguity. For example, the
> data sample above is generated from an ambiguous pattern "NOM ADJ ADJ",
> with the rules :
>         rule-15 (NOM ADJ ADJ)
>      class 0
>         rule-14 (NOM ADJ) ,  32 (ADJ)
>    class 1
>         rule-1 (NOM) ,  rule-32 (ADJ) ,  rule-32 (ADJ)
> class 2
>
>
>     3. Sklearn dataset is loaded into a pandas data-frame, where
> rule/class is the *target*, fractional count is the *sample weight* and
> pattern words are the *features*.
>
>
>     4. Features are encoded with a simple encoder called *OrdinalEncoder*.
> It just encodes words as numbers from *0* to *n-1*, where *n* is the
> number of unique words
>         in the training dataset. The encoder works with each feature
> separately, that is the first feature is encoded from 0 to (*n**1* - 1)
> where *n**1 *is the number of unique
>         words in that feature, and similarly the second feature is encoded
> from 0 to (*n2* - 1) where *n2* is the number of unique words in that
> feature, and so on.
>         For example, if we have 3 features, each has only 2 unique words
> as follows :
>               debate monográfico importante
>               competitividad político económico
>
>        The encoding will be as follows :
>              debate=0  monográfico=1  importante=2
>              competitividad=0  político=1  económico=2
>
>       And this is a drawback as if we have a test sample *"*competitividad
> *económico* importante*"*, the encoder will raise an error because it
> didn't see *económico*
>       before in the *second* feature, though it's an *adjective*. So a
> solution may be to use another encoder or just modify the use of the
> *OrdinalEncoder*.
>       So we combine the features instead of having them separated, it
> should be encoded as follows :
>             debate=0  monográfico=2  importante=4
>             competitividad=1  político=3  económico=5
>
>       This was just encoding a string as a number, but there are other
> encoders such as *OneHotEncoder *or yet other encoders not included in
> sklearn, which
>       could catch useful semantics in their representation.
>
> *      Do you recommend any specific encoder?*
>
>
>     5. SVM model accuracy is pretty similar to that of the max entropy
> model, which in turn similar to a *random* model. The accuracy
> approximately equals
>         (1/*n*)*100 %, where *n *is the number of classes. For example,
> if a dataset has 2 classes the accuracy equals 50%, and if a dataset has 5
> classes the accuracy
>         equals 20% and so on. Why this behavior? I think maybe because of
> the language model scores, as they are very close to each other. What is
> the solution?
>         I really don't know if another language model other than the
> n-gram, could improve the results, *or *maybe a better encoder could
> improve the results.
>
>         I have an idea here but I don't know if it's right or not, and
> it's about using WER, PER, BLEU, or a combination of them, instead of the
> language model score
>         for classification, so it will be our new sample weight.
>
>         *Do you have any thoughts here?*
>
>
>     6. Some datasets sizes are very large that have hundreds of thousands
> or even millions of records, which make their training with sklearn SVM
> model not practical
>         at all, as I tried to train a dataset with *800k* records and it
> took about *30* hours in my pc, and we have *7* datasets with more than
> *1* million records, and the largest
>         dataset has more than *14* million records which would take about
> *21* days to train (by simple interpolation, actually it would take
> multiples of that).
>
>         I had some options to choose from :
>           - To work with a *c/c++* SVM library other than the *python*
> sklearn library. But since I have written scripts and a program to
> integrate sklearn with our module,
>              it is not the easier solution to re-implement all of these.
>
>           - To run the training using google *colab* with *GPU *enhancement.
> I tried this solution, and it was faster by about *30%*, but this was not
> enough because it has
>             a running limit of *8* consecutive hours, so it will not be
> enough for the *800k* records dataset which will run in about *20* hours
> now.
>
>           - To set a maximum threshold for the size of the records. By
> trying, I found that *200k *limit is a good choice, it's not too large
> nor too small, and takes about
>             *1.5* hours.
>
>            I chose this last solution, to be able to test all the scripts,
> the program, and sklearn integration with our module, and because about 15
> datasets from 230
>            are the ones affected by the threshold. And now it all works
> well.
>
>         *What do you recommend here, should I continue with the threshold
> solution, or are there other solutions?*
>
>
>    7. After finishing encoder and datasets size issues, we can then tune
> SVM hyper-parameters as penalty rate (*C *parameter) and the *kernel *to
> acheive
>        better accuracy.
>
>
>
> As I was sick for more than 2 weeks, I will add 2-3 weeks of work after
> the end of GSoC. So this is the list of tasks to be done before 15
> September :
>
>    1. Finishing the main task of GSoC idea, which is extending the module
> to interchunk and postchunk. I already began two days ago reading apertium2
>        documentation to remember and get familiar again with transfer
> rules files. But there still a problem with ambiguous transfer rules in
> .t2x and .t3x files,
>        that is I don't know if there are enough if any ambiguous rules to
> test the module after finishing implementation.
>
>        *So could you provide me with updates on that ?*
>
>
>    2. Writing a documentation for all programs and scripts usage.
>
>
>    3. Refactoring and integrating the module into apertium pipeline. As I
> didn't build up on the refactored version you asked me to integrate before
> GSoC.
>        (Though I think this is hard to be finished before September 15)
>
>
> *   Do you have any thoughts on what to do next?*
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to