On 02/19/2014 01:25 PM, William Colen wrote:
Is the SequenceValidator the only thing we need to change? If a corpus uses
BILOU, the formatters need to convert it to IOB2?

The format parsing code creates Span objects. The name finder and chunker take these Span objects and
then perform IOB2 coding on them (start, cont, other).

The coding is done in to places, first during training the Span are encoded, and during tagging the tag sequences
are decoded into Span objects again.

An interface like this could work for the name finder (didn't check the chunker yet):
public interface class SequenceCodec {
  Span[] decode(List<String> c);
  String[] encode(Span names[], int length);
  SequenceValidator createSequenceValidator();
}

The Sequence Validator depends of course on the used codec and could be created by a factory
method.

Some machine learners e.g. Mallet CRF don't support our sequence validation. I am not yet sure how we
handle that case.

Jörn

Reply via email to