On 23 September 2013 08:17, Per Tunedal <per.tune...@operamail.com> wrote:
> Hi,
> what should the text-files look like before starting the tagger
> training? One sentence a line? Something else?
> Is a text formatted like below OK:
> Antingen genom att gå in under rätt rubrik ovan och lägga till ditt
> bidrag eller lägg ditt bidrag i bufferten om du inte vet var eller hur
> det ska stå.
> I Önskelistan lägger du förslag på sånt du tycker borde vara med.
> Or should e.g. the punctuation marks be separated like:
> I Önskelistan lägger du förslag på sånt du tycker borde vara med .

No, you don't need to do that. You don't really need to have the text
split into sentences either, but it makes life a little easier if
there are problems.

Some of the older language pairs have makefiles for tagger training.
At a minimum, you will need to adapt the variables for language, and
make sure that lt-proc is called with the same set of switches as the
primary mode (if you're training for Swedish in sv-da, the mode will
be the one that starts <mode name="sv-da" install="yes">).

The tagset specification is where you have the most scope to control
the tagger. I wrote a linter tool because of problems you were
reporting, I'd recommend that you run it before training.

<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

