Thank you for your further detail.
Sure, your answer clarifies better the messages and the terminology, but
still I don't completely understand.
I already tried it with a cutoff of 1, after Joern answer and it worked the
same.
Moreover, I tried it with other cutoff values (and also with the bigger
file in attachment).
Some extracts from different outputs (with my - to be considered in
incremental way - notes, search for ******** ):
Cutoff=0 or Cutoff=1 (No line dropped and everything clear to me)
Indexing events using cutoff of 1
Computing event counts... done. 48 events
Indexing... done.
Sorting and merging events... done. Reduced 48 events to 48.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 48
Number of Outcomes: 3
Number of Predicates: 94
------------------------------------------------------------------------------
Indexing events using cutoff of 2
Computing event counts... done. 48 events
Indexing... Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
******** WHY????
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla] ******** UHMMM (stop words? Is this
the reason why there is the language parameter?)
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=servizio, bow=inaccettabile] ******** ??? Here I
cannot understand, there are no reasonable stop words, so previous
hypothesis doesn't apply
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti] ******** WHY? I
misunderstood something, I think
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 35 events to 33.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 33
Number of Outcomes: 3
Number of Predicates: 24
------------------------------------------------------------------------------
Indexing events using cutoff of 3
Computing event counts... done. 48 events
Indexing... Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito] ********
WHY? This wasn't in the set of dropped lines in the previous run
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
******** WHY? This wasn't in the set of dropped lines in the previous run
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 26 events to 23.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 23
Number of Outcomes: 3
Number of Predicates: 11
------------------------------------------------------------------------------
Indexing events using cutoff of 4
Computing event counts... done. 48 events
Indexing... Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
Dropped event Ok:[bow=esperienza, bow=sicuramente, bow=entusiasmante]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=va, bow=malissimo] ******** WHY???? This is a real
surprise: it wasn't in no previous run and while in the previous cases it
seemed to me to have missed some details, this drop event, if compared with
previous runs, completely makes me think I didn't understand anything
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=come, bow=fare, bow=ad, bow=essere, bow=contenti]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=così, bow=così] ******** WHY??? Same surprise
Dropped event Insomma:[bow=tutto, bow=sommato, bow=poteva, bow=essere,
bow=peggio] ******** WHY?
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 21 events to 14.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 14
Number of Outcomes: 3
Number of Predicates: 6
------------------------------------------------------------------------------
Indexing events using cutoff of 5
Computing event counts... done. 48 events
Indexing... Dropped event Ok:[bow=ok]
Dropped event Ok:[bow=tutto, bow=bene]
Dropped event Ok:[bow=mi, bow=sono, bow=molto, bow=divertito]
Dropped event Ok:[bow=fantastica, bow=scelta]
Dropped event Ok:[bow=certamente, bow=un, bow=ottimo, bow=risultato]
Dropped event Ok:[bow=fantastica, bow=soluzione]
Dropped event Ok:[bow=positiva]
Dropped event Ok:[bow=niente, bow=affatto, bow=male] ******** WHY???
Another surprise
Dropped event Ok:[bow=ritornare, bow=con, bow=amici, bow=e, bow=parenti]
Dropped event Ok:[bow=esperienza, bow=sicuramente, bow=entusiasmante]
Dropped event Ok:[bow=splendido]
Dropped event Ok:[bow=affascinante]
Dropped event no:[bow=per, bow=nulla]
Dropped event no:[bow=negativa]
Dropped event no:[bow=terribile]
Dropped event no:[bow=orribile]
Dropped event no:[bow=orrendo]
Dropped event no:[bow=niente, bow=affatto, bow=divertente]
Dropped event no:[bow=va, bow=malissimo]
Dropped event no:[bow=va, bow=decisamente, bow=male]
Dropped event no:[bow=sono, bow=molto, bow=triste]
Dropped event no:[bow=come, bow=fare, bow=ad, bow=essere, bow=contenti]
Dropped event no:[bow=servizio, bow=inaccettabile]
Dropped event no:[bow=male, bow=davvero]
Dropped event no:[bow=sconsiglio, bow=a, bow=tutti]
Dropped event Insomma:[bow=così, bow=così]
Dropped event Insomma:[bow=tutto, bow=sommato, bow=poteva, bow=essere,
bow=peggio]
Dropped event Insomma:[bow=ma, bow=anche, bow=meglio]
Dropped event Insomma:[bow=nè, bow=caldo, bow=nè, bow=freddo]
Dropped event Insomma:[bow=mi, bow=lascia, bow=assolutamente,
bow=indifferente]
Dropped event Insomma:[bow=insomma]
done.
Sorting and merging events... done. Reduced 17 events to 6.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 6
Number of Outcomes: 3
Number of Predicates: 2
As you can see, it's not a matter of making it work (now, from a "user"
point of view it seems to work), but it is a matter of understanding (also
to give hint on how to compile the set to final users).
Thank you again
Alessandro
2017-07-13 22:02 GMT+02:00 Daniel Russ <[email protected]>:
> Hello Alessandro,
> Jörn is correct, you don’t have enough data. But let’s force it to work.
>
> Every line in your training file is a Document and also a training EVENT.
> An event is an Outcome and a set of features [also known as the context].
>
> so the first line of your data is
>
> Ok ok <- this is the event with outcome=Ok and context ok. The
> DocumentTrainer make it [bow=ok] because it is a Bag Of Words model.
> Ok tutto bene <- next event with outcome Ok context = [tutto and bene]
> …
>
> when all finished the feature ok, tutto, bene and others occurs once in
> your in training set. I think only “non” occurs more than once (it occurs
> 3 times).
> OpenNLP counts the number of time each feature occurs, if the number is
> less than the cutoff, it removes the feature. If an event has no features,
> OpenNLP drops the event.
> All of your events are dropped.
>
> I ran it with
>
> opennlp DoccatTrainer -params it-params.txt -lang it -model modelIt.bin
> -data Train1.train
> where it-params.txt is a file containing
> Cutoff=0
>
> Indexing events using cutoff of 0
>
> Computing event counts... done. 12 events
> Indexing... done.
> Sorting and merging events... done. Reduced 12 events to 12.
> Done indexing.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 12
> Number of Outcomes: 2
> Number of Predicates: 27
> ...done.
> Computing model parameters ...
> Performing 100 iterations.
> 1: ... loglikelihood=-8.317766166719343 0.5
> 2: ... loglikelihood=-7.093654793495284 1.0
> 3: ... loglikelihood=-6.256360369219492 1.0
> =======================================
> 99: ... loglikelihood=-0.6058512089422219 1.0
> 100: ... loglikelihood=-0.6002320355445091 1.0
> Writing document categorizer model ... done (0.048s)
>
> Wrote document categorizer model to
> path: /Users/druss/modelIt.bin
>
> Did I answer your question?
> Daniel
>
> > On Jul 11, 2017, at 11:12 AM, Alessandro Depase <
> [email protected]> wrote:
> >
> > Thank you.
> > So, if I correctly understand, the cutoff is related to features and not
> to
> > lines as I wrongly understood from some Internet examples (and not
> > carefully reading the documentation too... my bad)
> >
> > Reading the documentation, I find that if no feature generator has been
> > specified, "Bag of words" is used.
> >
> > It's not so clear that this means "tested on every single training line"
> > and not "on every category" (after your answer maybe more clear, indeed).
> > Roughly speaking, this means that a training line with more words than
> the
> > cutoff (more than 5 words according to the default) is kept and with less
> > than the cutoff is dropped, is it correct?
> >
> > If this is correct, using the defaults, it should suffice lower the
> cutoff
> > to 1, not to zero (a line with no word is meaningless anyway - and I
> think
> > already tested before as well formatted).
> >
> > I'll make some tests in this direction, thank you so much
> > Alessandro
> >
> >
> >
> > 2017-07-11 16:40 GMT+02:00 Joern Kottmann <[email protected]>:
> >
> >> An event is dropped when the cutoff is so high that all features are
> >> removed from that event.
> >> I recommend to train with more data or to decrease the cutoff value to
> >> zero.
> >>
> >> Jörn
> >>
> >> On Tue, Jul 11, 2017 at 3:44 PM, Alessandro Depase
> >> <[email protected]> wrote:
> >>> Hi all,
> >>> I'm trying to perform my first (newbie) document categorization using
> >>> italian language.
> >>> I'm using a very simple file with this content:
> >>>
> >>> Ok ok
> >>> Ok tutto bene
> >>> Ok decisamente non male
> >>> Ok fantastica scelta
> >>> Ok non pensavo di poter essere così contento
> >>> Ok certamente un'ottimo risultato
> >>> no non va affatto bene
> >>> no per nulla
> >>> no niente affatto divertente
> >>> no va malissimo
> >>> no va decisamente male
> >>> no sono molto triste
> >>>
> >>> (no lines before or after the quoted ones - and, yes, I know that in
> >>> Italian "un'ottimo" is an error, but it was part of my list :) ) and i
> >> got
> >>> this output:
> >>>
> >>> $ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> >>> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\
> >> sandbox\sourcegen\MrJEditor\sandbox\Train1.train"
> >>> -encoding UTF-8
> >>> Indexing events using cutoff of 5
> >>>
> >>> Computing event counts... done. 12 events
> >>> Indexing... Dropped event Ok:[bow=ok]
> >>>
> >>> Dropped event Ok:[bow=tutto, bow=bene]
> >>> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> >>> Dropped event Ok:[bow=fantastica, bow=scelta]
> >>> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> >>> bow=così, bow=contento]
> >>> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> >>> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> >>> Dropped event no:[bow=per, bow=nulla]
> >>> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> >>> Dropped event no:[bow=va, bow=malissimo]
> >>> Dropped event no:[bow=va, bow=decisamente, bow=male]
> >>> Dropped event no:[bow=sono, bow=molto, bow=triste]
> >>> done.
> >>> Sorting and merging events...
> >>>
> >>> ERROR: Not enough training data
> >>> The provided training data is not sufficient to create enough events to
> >>> train a model.
> >>> To resolve this error use more training data, if this doesn't help
> there
> >>> might
> >>> be some fundamental problem with the training data itself.
> >>>
> >>> I already found a couple of other similar issues on the Internet, just
> >>> saying that there are not enough lines (but I have 6 lines for each
> >>> category and a cutoff of 5) or that without at least 100 lines the
> >>> categorization quality is not sufficient (ok, but that's just a quality
> >>> matter, it should work, with bad results, but it should work). The
> reason
> >>> for insufficient data is that all the lines are dropped. Someone seems
> to
> >>> succeed with even 10 lines.
> >>> But why? What did I miss? I cannot find useful documentation...
> >>>
> >>> Please note that my question is about *why* the lines are dropped,
> about
> >>> the reason, the logic behind dropping them.
> >>> I tried to understand the code, (I stopped when it required too much
> time
> >>> without downloading and debugging it) and that's what I understood:
> >>> *the AbstractDataIndexer throws the exception in the method
> _sortAndMerge
> >>> _because it "thinks" there isn't enough data* but it uses the *List
> >>> eventsToCompare*, which is the result of a previous computation, which
> >>> happens in the same class, *method index(ObjectStream<Event> events,
> >>> Map<String, Integer=""> predicateIndex)*
> >>> * there the code builds a int[] starting from each line in a way I
> cannot
> >>> completely understood (my question, at the very end, is: what is the
> >> logic
> >>> behind the compilation of this array?). If the array has more than an
> >>> element, then ok, we have elements to compare (and the sortAndMerge
> will
> >>> not throw this Exception), else the line is dropped. So: what is the
> >> logic
> >>> behind dropping the line?
> >>> The documentation, just talks about the cutoff value, but I compiled
> more
> >>> lines than requested by the cutoff.
> >>> So: to complete the question, is there a way to quantifiy the minimum
> >>> quantity of lines or words or whatever needed? Why are available online
> >>> examples working with 10 lines and my example not? I don't mind the
> >> quality
> >>> here, I completely understand that it will not produce a meaningful
> >> result
> >>> in a real case, but why I got an Excepion and other not?
> >>>
> >>> In the meanwhile I tried with more or less 15 lines and it returned no
> >>> exception. The quality of categorization was very low, as expected (it
> >>> almost always returned "ok", also to sentences in the training set - is
> >> it
> >>> related to the fact that the corresponding lines were dropped and the
> >> train
> >>> happened only on few others?). With 29 lines it becomes to give
> >> meaningful
> >>> answers, nonetheless the questions remain.
> >>>
> >>> Thank you in advance for your support
> >>> Kind Regards
> >>> Alessandro
> >>
>
>