Hi all,
I'm trying to perform my first (newbie) document categorization using
italian language.
I'm using a very simple file with this content:
Ok ok
Ok tutto bene
Ok decisamente non male
Ok fantastica scelta
Ok non pensavo di poter essere così contento
Ok certamente un'ottimo risultato
no non va affatto bene
no per nulla
no niente affatto divertente
no va malissimo
no va decisamente male
no sono molto triste
(no lines before or after the quoted ones - and, yes, I know that in
Italian "un'ottimo" is an error, but it was part of my list :) ) and i got
this output:
$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
"C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
-encoding UTF-8
Indexing events using cutoff of 5
Computing event counts... done. 12 events
Indexing... Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=decisamente, bow=non, bow=male]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
bow=così, bow=contento]
Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
done.
Sorting and merging events...
ERROR: Not enough training data
The provided training data is not sufficient to create enough events to
train a model.
To resolve this error use more training data, if this doesn't help there
might
be some fundamental problem with the training data itself.
I already found a couple of other similar issues on the Internet, just
saying that there are not enough lines (but I have 6 lines for each
category and a cutoff of 5) or that without at least 100 lines the
categorization quality is not sufficient (ok, but that's just a quality
matter, it should work, with bad results, but it should work). The reason
for insufficient data is that all the lines are dropped. Someone seems to
succeed with even 10 lines.
But why? What did I miss? I cannot find useful documentation...
Please note that my question is about *why* the lines are dropped, about
the reason, the logic behind dropping them.
I tried to understand the code, (I stopped when it required too much time
without downloading and debugging it) and that's what I understood:
*the AbstractDataIndexer throws the exception in the method _sortAndMerge
_because it "thinks" there isn't enough data* but it uses the *List
eventsToCompare*, which is the result of a previous computation, which
happens in the same class, *method index(ObjectStream<Event> events,
Map<String, Integer=""> predicateIndex)*
* there the code builds a int[] starting from each line in a way I cannot
completely understood (my question, at the very end, is: what is the logic
behind the compilation of this array?). If the array has more than an
element, then ok, we have elements to compare (and the sortAndMerge will
not throw this Exception), else the line is dropped. So: what is the logic
behind dropping the line?
The documentation, just talks about the cutoff value, but I compiled more
lines than requested by the cutoff.
So: to complete the question, is there a way to quantifiy the minimum
quantity of lines or words or whatever needed? Why are available online
examples working with 10 lines and my example not? I don't mind the quality
here, I completely understand that it will not produce a meaningful result
in a real case, but why I got an Excepion and other not?
In the meanwhile I tried with more or less 15 lines and it returned no
exception. The quality of categorization was very low, as expected (it
almost always returned "ok", also to sentences in the training set - is it
related to the fact that the corresponding lines were dropped and the train
happened only on few others?). With 29 lines it becomes to give meaningful
answers, nonetheless the questions remain.
Thank you in advance for your support
Kind Regards
Alessandro