Il giorno 17/set/2013, alle ore 09:19, Jörn Kottmann ha scritto: > On 09/17/2013 08:55 AM, Giorgio Valoti wrote: >> Hi all, >> this is my first post to the list. I’ve tried to gather some info from the >> documentation and googling around but I haven’t found a satisfying answer to >> the following questions. Please tell me where to RTFM if some of these >> questions belong to some FAQ or are off-topic. >> >> It seems there’s no way to incrementally train the POS tagger nor to >> parallelize this task. Is this correct? >> >> If the only way to train the POS tagger is in one single shot, how can I >> estimate memory requirements for the JVM? In other words, given, say, a 1GB >> training corpus, is there a way to estimate how much RAM would it be needed? >> >> Finally, I have tried to use the `-ngram` switch: >>> opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as >>> usual: -lang -model -data -encoding> >> but I get this error: >>> Building ngram dictionary ... IO error while building NGram Dictionary: >>> Stream not marked >>> Stream not marked >>> java.io.IOException: Stream not marked >>> at java.io.BufferedReader.reset(BufferedReader.java:485) >>> at >>> opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79) >>> at >>> opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43) >>> at >>> opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43) >>> at >>> opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80) >>> at opennlp.tools.cmdline.CLI.main(CLI.java:222) >> >> But I can’t find out what I’m doing wrong. >> > > Looks like it tries to reset the stream, but that doesn't seem to work. Do > you use 1.5.3? > Please open a jira issue for this so we can fix it.
Yes, it’s 1.5.3. I’ll open it ASAP. > > Usually the pos tagger is trained without this ngram option, it is some old > left-over experiment in the code which didn't turn > out to improve things. Ah ok, that’s good to know. In fact, I got pretty good results without it, but I was wondering if the -ngram could deliver even more precise results > > If you have that much data you probably want to use the Two Pass Data Indxer, > I am not sure if the pos tagger is doing that by default, > a higher cutoff might help to reduce the required memory, and otherwise just > try to give the process a couple of gigabytes of ram. > > On which data set are you training? A gigabyte of training data is quite a > lot ... <http://www.corpusitaliano.it/en/index.html> The whole corpus is well over 9GB. It’s not my plan to analyze the whole thing, of course! Do you think would it realistic to use the evaluation tool to decide a reasonable size for the corpus? I’m not an expert, but I guess there’s no point in analyzed that many data if you can achieve a good enough accuracy with a much smaller sample, right? Ciao -- Giorgio Valoti
