Hello, i just added the <SPLIT> Tag because all (only) whitespace files weren't able to processed by the command line tool. It just found 1 Feature and the training endet with an exception like "Unable to create model due to" in the first interation and all the liklihoods are 1.0. I just replaced all whitespaces with the split tag as described in the documentation.
Andreas Am 13.03.2013 20:13, schrieb Jörn Kottmann: > The tokenizers defaults are for text which is mostly whitespace separated, > did you lost all your white spaces in the text you want to process? > > Jörn > > On 03/13/2013 04:31 PM, Andreas Niekler wrote: >> Hello, >> >> i give you some examples below this comment. But i already noticed in >> the code, that the standard tokenizerTrainer call uses the standard >> alphanumeric pattern which won't work for typical german examples. Most >> of the data will be separated because of the inproper pattern in the >> standard Factory.java class. My believe is that the de-token.bin model >> was trained with a proper pattern within another implementation of the >> training procedure. >> >> Here are some training lines: >> >> Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>. >> >> Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>. >> >> Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>. >> >> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>. >> >> Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>. >> >> Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>. >> >> Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>? >> >> Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>. >> >> Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>. >> >> Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>. >> >> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>. >> >> Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>. >> >> Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>. >> >> Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>. >> >> IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>. >> >> Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>. >> >> Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>. >> >> Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>. >> >> Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>. >> >> Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>. >> >> >> Am 13.03.2013 15:52, schrieb Jörn Kottmann: >>> Hello, >>> >>> can you tell us a bit more about your training data. Did you manually >>> annotate these 300k sentences? >>> Is it possible to post 10 lines or so here? >>> >>> Jörn >>> >>> On 03/12/2013 03:22 PM, Andreas Niekler wrote: >>>> Dear List, >>>> >>>> i created a Tokenizer Model with 300k german Sentences from a very >>>> clean >>>> corpus. I see some words that are very strangly separated by a >>>> tokenizer >>>> using this model like: >>>> >>>> stehenge - blieben >>>> fre - undlicher >>>> >>>> and so on. I cant find those in my training data and wonder why openNLP >>>> splits those words without any evidence in the training data and wihout >>>> any whitespace in my text files. I trained the model with 500 >>>> Iterations, cutoff 5 and alphanumeric optimisation. >>>> >>>> Can anyone state some ideas how i can prevent this? >>>> >>>> thank you >>>> >>>> Andreas > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
