An event is dropped when the cutoff is so high that all features are removed from that event. I recommend to train with more data or to decrease the cutoff value to zero.
Jörn On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase <[email protected]> wrote: > Hi all, > I'm trying to perform my first (newbie) document categorization using > italian language. > I'm using a very simple file with this content: > > Ok ok > Ok tutto bene > Ok decisamente non male > Ok fantastica scelta > Ok non pensavo di poter essere così contento > Ok certamente un'ottimo risultato > no non va affatto bene > no per nulla > no niente affatto divertente > no va malissimo > no va decisamente male > no sono molto triste > > (no lines before or after the quoted ones - and, yes, I know that in > Italian "un'ottimo" is an error, but it was part of my list :) ) and i got > this output: > > $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data > "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\sourcegen\MrJEditor\sandbox\Train1.train" > -encoding UTF-8 > Indexing events using cutoff of 5 > > Computing event counts... done. 12 events > Indexing... Dropped event Ok:[bow=ok] > > Dropped event Ok:[bow=tutto, bow=bene] > Dropped event Ok:[bow=decisamente, bow=non, bow=male] > Dropped event Ok:[bow=fantastica, bow=scelta] > Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, > bow=così, bow=contento] > Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato] > Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene] > Dropped event no:[bow=per, bow=nulla] > Dropped event no:[bow=niente, bow=affatto, bow=divertente] > Dropped event no:[bow=va, bow=malissimo] > Dropped event no:[bow=va, bow=decisamente, bow=male] > Dropped event no:[bow=sono, bow=molto, bow=triste] > done. > Sorting and merging events... > > ERROR: Not enough training data > The provided training data is not sufficient to create enough events to > train a model. > To resolve this error use more training data, if this doesn't help there > might > be some fundamental problem with the training data itself. > > I already found a couple of other similar issues on the Internet, just > saying that there are not enough lines (but I have 6 lines for each > category and a cutoff of 5) or that without at least 100 lines the > categorization quality is not sufficient (ok, but that's just a quality > matter, it should work, with bad results, but it should work). The reason > for insufficient data is that all the lines are dropped. Someone seems to > succeed with even 10 lines. > But why? What did I miss? I cannot find useful documentation... > > Please note that my question is about *why* the lines are dropped, about > the reason, the logic behind dropping them. > I tried to understand the code, (I stopped when it required too much time > without downloading and debugging it) and that's what I understood: > *the AbstractDataIndexer throws the exception in the method _sortAndMerge > _because it "thinks" there isn't enough data* but it uses the *List > eventsToCompare*, which is the result of a previous computation, which > happens in the same class, *method index(ObjectStream<Event> events, > Map<String, Integer=""> predicateIndex)* > * there the code builds a int[] starting from each line in a way I cannot > completely understood (my question, at the very end, is: what is the logic > behind the compilation of this array?). If the array has more than an > element, then ok, we have elements to compare (and the sortAndMerge will > not throw this Exception), else the line is dropped. So: what is the logic > behind dropping the line? > The documentation, just talks about the cutoff value, but I compiled more > lines than requested by the cutoff. > So: to complete the question, is there a way to quantifiy the minimum > quantity of lines or words or whatever needed? Why are available online > examples working with 10 lines and my example not? I don't mind the quality > here, I completely understand that it will not produce a meaningful result > in a real case, but why I got an Excepion and other not? > > In the meanwhile I tried with more or less 15 lines and it returned no > exception. The quality of categorization was very low, as expected (it > almost always returned "ok", also to sentences in the training set - is it > related to the fact that the corresponding lines were dropped and the train > happened only on few others?). With 29 lines it becomes to give meaningful > answers, nonetheless the questions remain. > > Thank you in advance for your support > Kind Regards > Alessandro
