Re: Document Categorizer all events dropped

Joern Kottmann Tue, 11 Jul 2017 07:49:53 -0700

An event is dropped when the cutoff is so high that all features are
removed from that event.
I recommend to train with more data or to decrease the cutoff value to zero.


Jörn

On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
<[email protected]> wrote:
> Hi all,
> I'm trying to perform my first (newbie) document categorization using
> italian language.
> I'm using a very simple file with this content:
>
> Ok ok
> Ok tutto bene
> Ok decisamente non male
> Ok fantastica scelta
> Ok non pensavo di poter essere così contento
> Ok certamente un'ottimo risultato
> no non va affatto bene
> no per nulla
> no niente affatto divertente
> no va malissimo
> no va decisamente male
> no sono molto triste
>
> (no lines before or after the quoted ones - and, yes, I know that in
> Italian "un'ottimo" is an error, but it was part of my list :) ) and i got
> this output:
>
> $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
> -encoding UTF-8
> Indexing events using cutoff of 5
>
>     Computing event counts... done. 12 events
>     Indexing... Dropped event Ok:[bow=ok]
>
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
>
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to
> train a model.
> To resolve this error use more training data, if this doesn't help there
> might
> be some fundamental problem with the training data itself.
>
> I already found a couple of other similar issues on the Internet, just
> saying that there are not enough lines (but I have 6 lines for each
> category and a cutoff of 5) or that without at least 100 lines the
> categorization quality is not sufficient (ok, but that's just a quality
> matter, it should work, with bad results, but it should work). The reason
> for insufficient data is that all the lines are dropped. Someone seems to
> succeed with even 10 lines.
> But why? What did I miss? I cannot find useful documentation...
>
> Please note that my question is about *why* the lines are dropped, about
> the reason, the logic behind dropping them.
> I tried to understand the code, (I stopped when it required too much time
> without downloading and debugging it) and that's what I understood:
> *the AbstractDataIndexer throws the exception in the method _sortAndMerge
> _because it "thinks" there isn't enough data* but it uses the *List
> eventsToCompare*, which is the result of a previous computation, which
> happens in the same class, *method index(ObjectStream<Event> events,
> Map<String, Integer=""> predicateIndex)*
> * there the code builds a int[] starting from each line in a way I cannot
> completely understood (my question, at the very end, is: what is the logic
> behind the compilation of this array?). If the array has more than an
> element, then ok, we have elements to compare (and the sortAndMerge will
> not throw this Exception), else the line is dropped. So: what is the logic
> behind dropping the line?
> The documentation, just talks about the cutoff value, but I compiled more
> lines than requested by the cutoff.
> So: to complete the question, is there a way to quantifiy the minimum
> quantity of lines or words or whatever needed? Why are available online
> examples working with 10 lines and my example not? I don't mind the quality
> here, I completely understand that it will not produce a meaningful result
> in a real case, but why I got an Excepion and other not?
>
> In the meanwhile I tried with more or less 15 lines and it returned no
> exception. The quality of categorization was very low, as expected (it
> almost always returned "ok", also to sentences in the training set - is it
> related to the fact that the corresponding lines were dropped and the train
> happened only on few others?). With 29 lines it becomes to give meaningful
> answers, nonetheless the questions remain.
>
> Thank you in advance for your support
> Kind Regards
> Alessandro

Re: Document Categorizer all events dropped

Reply via email to