I recommend reading the last mail in full-screen.
On Sun, Aug 11, 2019 at 1:50 PM Aboelhamd Aly <aboelhamd.abotr...@gmail.com> wrote: > Hi Francis, Sevilay, and all apertium mentors and contributors, > > I hope all of you are well and sound. > > This is an update for the progress on the sklearn SVM model : > > After some long research on using svm model instead of max entropy. There > was a misunderstanding of the dataset we generate, that unfortunately > slowed my research and took it into a distant way than it should have gone. > The problem was in the weights (fractional count) for each sample, as I > didn't find clear answer for my question "how to include this weights in > svm and how it really works", and my search took me into reading some > papers about new proposed svm models that solve this problem but the papers > were either off-topic or not clear enough to try and re-implement their > idea. > And fortunately finally, I found the solution beneath my legs in > scikit-learn library, as some classification models were given the option > of accepting weighted samples in training and testing, these models were > "NaiveBayes", "SVM", "DecisionTree", "RandomForest", "AdaBoost". > > 1. Yasmet data format > 0 $ 0.335668 # debate_0:0 monográfico_1:0 importante_2:0 # > debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 > importante_2:2 # > 1 $ 0.329582 # debate_0:0 monográfico_1:0 importante_2:0 # > debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 > importante_2:2 # > 2 $ 0.33475 # debate_0:0 monográfico_1:0 importante_2:0 # > debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 > importante_2:2 # > > Sklearn data format > 0 0.335668 debate monográfico importante > 1 0.329582 debate monográfico importante > 2 0.33475 debate monográfico importante > > > 2. The dataset is data generated for one ambiguity. For example, the > data sample above is generated from an ambiguous pattern "NOM ADJ ADJ", > with the rules : > rule-15 (NOM ADJ ADJ) > class 0 > rule-14 (NOM ADJ) , 32 (ADJ) > class 1 > rule-1 (NOM) , rule-32 (ADJ) , rule-32 (ADJ) > class 2 > > > 3. Sklearn dataset is loaded into a pandas data-frame, where > rule/class is the *target*, fractional count is the *sample weight* and > pattern words are the *features*. > > > 4. Features are encoded with a simple encoder called *OrdinalEncoder*. > It just encodes words as numbers from *0* to *n-1*, where *n* is the > number of unique words > in the training dataset. The encoder works with each feature > separately, that is the first feature is encoded from 0 to (*n**1* - 1) > where *n**1 *is the number of unique > words in that feature, and similarly the second feature is encoded > from 0 to (*n2* - 1) where *n2* is the number of unique words in that > feature, and so on. > For example, if we have 3 features, each has only 2 unique words > as follows : > debate monográfico importante > competitividad político económico > > The encoding will be as follows : > debate=0 monográfico=1 importante=2 > competitividad=0 político=1 económico=2 > > And this is a drawback as if we have a test sample *"*competitividad > *económico* importante*"*, the encoder will raise an error because it > didn't see *económico* > before in the *second* feature, though it's an *adjective*. So a > solution may be to use another encoder or just modify the use of the > *OrdinalEncoder*. > So we combine the features instead of having them separated, it > should be encoded as follows : > debate=0 monográfico=2 importante=4 > competitividad=1 político=3 económico=5 > > This was just encoding a string as a number, but there are other > encoders such as *OneHotEncoder *or yet other encoders not included in > sklearn, which > could catch useful semantics in their representation. > > * Do you recommend any specific encoder?* > > > 5. SVM model accuracy is pretty similar to that of the max entropy > model, which in turn similar to a *random* model. The accuracy > approximately equals > (1/*n*)*100 %, where *n *is the number of classes. For example, > if a dataset has 2 classes the accuracy equals 50%, and if a dataset has 5 > classes the accuracy > equals 20% and so on. Why this behavior? I think maybe because of > the language model scores, as they are very close to each other. What is > the solution? > I really don't know if another language model other than the > n-gram, could improve the results, *or *maybe a better encoder could > improve the results. > > I have an idea here but I don't know if it's right or not, and > it's about using WER, PER, BLEU, or a combination of them, instead of the > language model score > for classification, so it will be our new sample weight. > > *Do you have any thoughts here?* > > > 6. Some datasets sizes are very large that have hundreds of thousands > or even millions of records, which make their training with sklearn SVM > model not practical > at all, as I tried to train a dataset with *800k* records and it > took about *30* hours in my pc, and we have *7* datasets with more than > *1* million records, and the largest > dataset has more than *14* million records which would take about > *21* days to train (by simple interpolation, actually it would take > multiples of that). > > I had some options to choose from : > - To work with a *c/c++* SVM library other than the *python* > sklearn library. But since I have written scripts and a program to > integrate sklearn with our module, > it is not the easier solution to re-implement all of these. > > - To run the training using google *colab* with *GPU *enhancement. > I tried this solution, and it was faster by about *30%*, but this was not > enough because it has > a running limit of *8* consecutive hours, so it will not be > enough for the *800k* records dataset which will run in about *20* hours > now. > > - To set a maximum threshold for the size of the records. By > trying, I found that *200k *limit is a good choice, it's not too large > nor too small, and takes about > *1.5* hours. > > I chose this last solution, to be able to test all the scripts, > the program, and sklearn integration with our module, and because about 15 > datasets from 230 > are the ones affected by the threshold. And now it all works > well. > > *What do you recommend here, should I continue with the threshold > solution, or are there other solutions?* > > > 7. After finishing encoder and datasets size issues, we can then tune > SVM hyper-parameters as penalty rate (*C *parameter) and the *kernel *to > acheive > better accuracy. > > > > As I was sick for more than 2 weeks, I will add 2-3 weeks of work after > the end of GSoC. So this is the list of tasks to be done before 15 > September : > > 1. Finishing the main task of GSoC idea, which is extending the module > to interchunk and postchunk. I already began two days ago reading apertium2 > documentation to remember and get familiar again with transfer > rules files. But there still a problem with ambiguous transfer rules in > .t2x and .t3x files, > that is I don't know if there are enough if any ambiguous rules to > test the module after finishing implementation. > > *So could you provide me with updates on that ?* > > > 2. Writing a documentation for all programs and scripts usage. > > > 3. Refactoring and integrating the module into apertium pipeline. As I > didn't build up on the refactored version you asked me to integrate before > GSoC. > (Though I think this is hard to be finished before September 15) > > > * Do you have any thoughts on what to do next?* >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff