Hello,

not sure I understand what you are trying to do.

The doccat component can assign a category to a text (or a piece of text),
so that will probably work well if you want to assign a category to an entire
CV or a paragraph in it.

If you want to identify skills mentioned inside a CV you might want to use
the name finder instead (have a look at its documentation).

Anyway, the training format for the doccat component is one document per line where all the tokens are whitespace tokenized, the first token in a line is the category
(explained more detailed in the documentation with a sample).

like this:
category1 token_a token_b token_c
category2 token_c token_x
....

To do some testing you should have at least have a hundred lines in your training file.

HTH,
Jörn

On 05/29/2013 10:56 AM, Florin Langa wrote:
Hello everyone!

I have a question...maybe it a silly question but I don't know how to
manage it. I need to build a classifier for CV. In order to do this I
assume that I need to build a model file containing a set of skills. I have
a list of skills but I don't know how to build the input file. Here is a
sample of my input file:

Tiles and clinkers, setting experience Tile layer .
Silk screen printing Lead typesetter, printing shop .
CTI, computer telephony Alarm operator .
GifBuilder animation program Specialist book writer .
Gardening, study circle leadership Sports centre manager .
........
etc.

The first part, until the next capital letter is the skill name and the
second part is the job name.
Ex: Gardening, study circle leadership - skill name, Sports centre manager
- job name.

In order to create the actual training file I use the following command:

opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt -model
/tmp/en-language-jobs.bin

Now, my question is if the input file I am providing to the above command
has the right format.

Also, please note that I was able to create the training file but when
running the command

opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the results
are 100% irrelevant.

Best regards,
Florin


Reply via email to