Re: TokenizerTrainer

Jörn Kottmann Wed, 13 Mar 2013 12:14:29 -0700

The tokenizers defaults are for text which is mostly whitespace separated,
did you lost all your white spaces in the text you want to process?


Jörn

On 03/13/2013 04:31 PM, Andreas Niekler wrote:

Hello,

i give you some examples below this comment. But i already noticed in
the code, that the standard tokenizerTrainer call uses the standard
alphanumeric pattern which won't work for typical german examples. Most
of the data will be separated because of the inproper pattern in the
standard Factory.java class. My believe is that the de-token.bin model
was trained with a proper pattern within another implementation of the
training procedure.

Here are some training lines:

Senden<SPLIT>Pfleiderer<SPLIT>verkaufen<SPLIT>Düsseldorf<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Analyst<SPLIT>vom<SPLIT>Bankhaus<SPLIT>Lampe<SPLIT>,<SPLIT>Marc<SPLIT>Gabriel<SPLIT>,<SPLIT>stuft<SPLIT>die<SPLIT>Pfleiderer-Aktie<SPLIT>(<SPLIT>ISIN<SPLIT>DE0006764749<SPLIT>/<SPLIT>WKN<SPLIT>676474<SPLIT>)<SPLIT>von<SPLIT>"<SPLIT>halten<SPLIT>"<SPLIT>auf<SPLIT>"<SPLIT>verkaufen<SPLIT>"<SPLIT>herab<SPLIT>.
Der<SPLIT>vollständige<SPLIT>Zwischenbericht<SPLIT>wird<SPLIT>am<SPLIT>8<SPLIT>.<SPLIT>November<SPLIT>2010<SPLIT>um<SPLIT>12.00<SPLIT>Uhr<SPLIT>veröffentlicht<SPLIT>.
Besonders<SPLIT>in<SPLIT>ländlichen<SPLIT>Gegenden<SPLIT>sind<SPLIT>Telegrafenmaste<SPLIT>auch<SPLIT>heute<SPLIT>noch<SPLIT>weit<SPLIT>verbreitet<SPLIT>-<SPLIT>größtenteils<SPLIT>für<SPLIT>die<SPLIT>Festnetztelefonie<SPLIT>.
Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Sarah<SPLIT>Palin<SPLIT>als<SPLIT>Reality-Star<SPLIT>im<SPLIT>US-Fernsehen<SPLIT>auf<SPLIT>Sendung<SPLIT>15.11.10<SPLIT>4:58<SPLIT>:<SPLIT>Washington<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Sarah<SPLIT>Palin<SPLIT>hat<SPLIT>jetzt<SPLIT>eine<SPLIT>eigene<SPLIT>Show<SPLIT>.
Fotos<SPLIT>Terrorwarnung<SPLIT>-<SPLIT>Was<SPLIT>man<SPLIT>jetzt<SPLIT>beachten<SPLIT>sollte<SPLIT>Die<SPLIT>Sicherheitslage<SPLIT>spitzt<SPLIT>sich<SPLIT>zu<SPLIT>.
Newsticker<SPLIT>RSS-Feed<SPLIT>Morgenweb<SPLIT>Tausende<SPLIT>Siedler<SPLIT>protestieren<SPLIT>gegen<SPLIT>neuen<SPLIT>Baustopp<SPLIT>21.11.10<SPLIT>11:51<SPLIT>:<SPLIT>Jerusalem<SPLIT>(<SPLIT>dpa<SPLIT>)<SPLIT>-<SPLIT>Die<SPLIT>israelischen<SPLIT>Siedler<SPLIT>haben<SPLIT>ihre<SPLIT>Proteste<SPLIT>gegen<SPLIT>einen<SPLIT>erwarteten<SPLIT>neuen<SPLIT>Baustopp<SPLIT>im<SPLIT>Westjordanland<SPLIT>verschärft<SPLIT>.
Jetzt<SPLIT>einloggen<SPLIT>SchwarzKater<SPLIT>(<SPLIT>vor<SPLIT>4<SPLIT>Stunden<SPLIT>)<SPLIT>WTF<SPLIT>?
Das<SPLIT>Bankhaus<SPLIT>hat<SPLIT>das<SPLIT>Kursziel<SPLIT>für<SPLIT>die<SPLIT>Salzgitter-Aktien<SPLIT>von<SPLIT>69,00<SPLIT>auf<SPLIT>58,00<SPLIT>Euro<SPLIT>gesenkt<SPLIT>,<SPLIT>aber<SPLIT>die<SPLIT>Einstufung<SPLIT>auf<SPLIT>´<SPLIT>Overweight<SPLIT>´<SPLIT>belassen<SPLIT>.
Bundeskanzlerin<SPLIT>Angela<SPLIT>Merkel<SPLIT>(<SPLIT>CDU<SPLIT>)<SPLIT>ist<SPLIT>am<SPLIT>Dienstag<SPLIT>zum<SPLIT>Gipfel<SPLIT>der<SPLIT>Organisation<SPLIT>für<SPLIT>Sicherheit<SPLIT>und<SPLIT>Zusammenarbeit<SPLIT>in<SPLIT>Europa<SPLIT>(<SPLIT>OSZE<SPLIT>)<SPLIT>in<SPLIT>Kasachstan<SPLIT>eingetroffen<SPLIT>.
Mann<SPLIT>totgeprügelt<SPLIT>:<SPLIT>Haftstrafen<SPLIT>im<SPLIT>«<SPLIT>20-Cent-Prozess<SPLIT>»<SPLIT>Die<SPLIT>beiden<SPLIT>Schläger<SPLIT>jugendlichen<SPLIT>Schläger<SPLIT>sind<SPLIT>wegen<SPLIT>Körperverletzung<SPLIT>mit<SPLIT>Todesfolge<SPLIT>zu<SPLIT>Haftstrafen<SPLIT>verurteilt<SPLIT>worden<SPLIT>.
Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>AKTIEN<SPLIT>SCHWEIZ/Vorbörse<SPLIT>:<SPLIT>Leicht<SPLIT>höhere<SPLIT>Eröffnung<SPLIT>erwartet<SPLIT>-<SPLIT>Positive<SPLIT>US-Vorgaben<SPLIT>06.12.2010<SPLIT>08:45<SPLIT>Zürich<SPLIT>(<SPLIT>awp<SPLIT>)<SPLIT>-<SPLIT>Der<SPLIT>Schweizer<SPLIT>Aktienmarkt<SPLIT>dürfte<SPLIT>die<SPLIT>Sitzung<SPLIT>vom<SPLIT>Montag<SPLIT>mit<SPLIT>moderaten<SPLIT>Gewinnen<SPLIT>eröffnen<SPLIT>.
Werbung<SPLIT>'<SPLIT>)<SPLIT>;<SPLIT>AIG<SPLIT>hatte<SPLIT>sich<SPLIT>auf<SPLIT>dem<SPLIT>US-Häusermarkt<SPLIT>verspekuliert<SPLIT>.
Außerordentliche<SPLIT>Hauptversammlung<SPLIT>genehmigt<SPLIT>Aktiensplit<SPLIT>und<SPLIT>Vorratsbeschlüsse<SPLIT>für<SPLIT>Kapitalmaßnahmen<SPLIT>Ad-hoc-Mitteilung<SPLIT>übermittelt<SPLIT>durch<SPLIT>euro<SPLIT>adhoc<SPLIT>mit<SPLIT>dem<SPLIT>Ziel<SPLIT>einer<SPLIT>europaweiten<SPLIT>Verbreitung<SPLIT>.
Börsen-Ticker<SPLIT>RSS<SPLIT>›<SPLIT>News<SPLIT>Führungscrew<SPLIT>übernimmt<SPLIT>bei<SPLIT>Modekette<SPLIT>Schild<SPLIT>Stefan<SPLIT>Portmann<SPLIT>und<SPLIT>Thomas<SPLIT>Herbert<SPLIT>heissen<SPLIT>die<SPLIT>neuen<SPLIT>starken<SPLIT>Männer<SPLIT>bei<SPLIT>Schild<SPLIT>.
IrfanView<SPLIT>Lizenz<SPLIT>:<SPLIT>Freeware<SPLIT>—<SPLIT>Hersteller-Website<SPLIT>IrfanView<SPLIT>ist<SPLIT>ein<SPLIT>für<SPLIT>private<SPLIT>Zwecke<SPLIT>kostenloses<SPLIT>Bildbetrachtungs-<SPLIT>und<SPLIT>Bildbearbeitungsprogramm<SPLIT>,<SPLIT>das<SPLIT>für<SPLIT>kleinere<SPLIT>Belange<SPLIT>durchaus<SPLIT>ausreicht<SPLIT>.
Dragonica<SPLIT>So<SPLIT>testet<SPLIT>4Players<SPLIT>Bitte<SPLIT>einloggen<SPLIT>,<SPLIT>um<SPLIT>Spiel<SPLIT>in<SPLIT>die<SPLIT>Watchlist<SPLIT>aufzunehmen<SPLIT>.
Rating-Update<SPLIT>:<SPLIT>London<SPLIT>(<SPLIT>aktiencheck.de<SPLIT>AG<SPLIT>)<SPLIT>-<SPLIT>Robert<SPLIT>T<SPLIT>.<SPLIT>Cornell<SPLIT>,<SPLIT>Scott<SPLIT>L<SPLIT>.<SPLIT>Gaffner<SPLIT>und<SPLIT>Darren<SPLIT>Yip<SPLIT>,<SPLIT>Analysten<SPLIT>von<SPLIT>Barclays<SPLIT>Capital<SPLIT>,<SPLIT>stufen<SPLIT>die<SPLIT>Aktie<SPLIT>von<SPLIT>ITT<SPLIT>Industries<SPLIT>(<SPLIT>ISIN<SPLIT>US4509111021<SPLIT>/<SPLIT>WKN<SPLIT>860023<SPLIT>)<SPLIT>weiterhin<SPLIT>mit<SPLIT>dem<SPLIT>Rating<SPLIT>"<SPLIT>equal-weight<SPLIT>"<SPLIT>ein<SPLIT>.
Sollten<SPLIT>der<SPLIT>Branche<SPLIT>durch<SPLIT>"<SPLIT>populäre<SPLIT>Preiskürzungen<SPLIT>"<SPLIT>der<SPLIT>Regulierungsbehörden<SPLIT>weiter<SPLIT>Milliarden<SPLIT>entzogen<SPLIT>werden<SPLIT>,<SPLIT>sei<SPLIT>kaum<SPLIT>vorstellbar<SPLIT>,<SPLIT>wie<SPLIT>ein<SPLIT>flächendeckender<SPLIT>Breitbandausbau<SPLIT>noch<SPLIT>finanziert<SPLIT>werden<SPLIT>könnte<SPLIT>,<SPLIT>kritisierte<SPLIT>Obermann<SPLIT>.
Grichting/Von<SPLIT>Bergen<SPLIT>bewährten<SPLIT>sich<SPLIT>abstimmen<SPLIT>Online/Print<SPLIT>Täglich<SPLIT>stellt<SPLIT>der<SPLIT>BLICK<SPLIT>eine<SPLIT>Frage<SPLIT>des<SPLIT>Tages<SPLIT>.
Best<SPLIT>.<SPLIT>For<SPLIT>additional<SPLIT>information<SPLIT>,<SPLIT>please<SPLIT>visit<SPLIT>www.asih.bm<SPLIT>.<SPLIT>0<SPLIT>Bewertungen<SPLIT>dieses<SPLIT>Artikels<SPLIT>:<SPLIT>Kommentare<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>Geben<SPLIT>Sie<SPLIT>jetzt<SPLIT>einen<SPLIT>Kommentar<SPLIT>zu<SPLIT>diesem<SPLIT>Artikel<SPLIT>ab<SPLIT>.

Am 13.03.2013 15:52, schrieb Jörn Kottmann:

Hello,

can you tell us a bit more about your training data. Did you manually
annotate these 300k sentences?
Is it possible to post 10 lines or so here?

Jörn

On 03/12/2013 03:22 PM, Andreas Niekler wrote:

Dear List,

i created a Tokenizer Model with 300k german Sentences from a very clean
corpus. I see some words that are very strangly separated by a tokenizer
using this model like:

stehenge - blieben
fre - undlicher

and so on. I cant find those in my training data and wonder why openNLP
splits those words without any evidence in the training data and wihout
any whitespace in my text files. I trained the model with 500
Iterations, cutoff 5 and alphanumeric optimisation.

Can anyone state some ideas how i can prevent this?

thank you

Andreas

Re: TokenizerTrainer

Reply via email to