The regular expression name finder (opennlp.tools.namefind.RegexNameFinder) 
treats a sentence as group of tokens separated by a space. The regular 
expression use the Java Pattern for expression. Each token would be separated 
by a space in your regular expression. So the regular expression would be 
something like "[Cc]omputer [aA]rchitecture" which handles both upper and lower 
case

The DictionaryNameFinder makes a similar attempt to handle multiple tokens.
-----Original Message-----
From: Florin Langa [mailto:[email protected]] 
Sent: Wednesday, May 29, 2013 11:56 AM
To: users; [email protected]
Subject: Re: Training model files question

Hello Jorn,

First of all thank you for your answer. Now...I have another question for 
you...what if my category1 is containing multiple words?
For example let's say that one category is "Computer architecture". As I 
understood only the first token (in this case computer is considered). How can 
I create a category containing multiple tokens?
In the meanwhile I will follow your advice and I will have a look to the name 
finder as well.

Thank you!

Best regards,
Florin


2013/5/29 Jörn Kottmann <[email protected]>

> Hello,
>
> not sure I understand what you are trying to do.
>
> The doccat component can assign a category to a text (or a piece of 
> text), so that will probably work well if you want to assign a 
> category to an entire CV or a paragraph in it.
>
> If you want to identify skills mentioned inside a CV you might want to 
> use the name finder instead (have a look at its documentation).
>
> Anyway, the training format for the doccat component is one document 
> per line where all the tokens are whitespace tokenized, the first 
> token in a line is the category (explained more detailed in the 
> documentation with a sample).
>
> like this:
> category1 token_a token_b token_c
> category2 token_c token_x
> ....
>
> To do some testing you should have at least have a hundred lines in 
> your training file.
>
> HTH,
> Jörn
>
>
> On 05/29/2013 10:56 AM, Florin Langa wrote:
>
>> Hello everyone!
>>
>> I have a question...maybe it a silly question but I don't know how to 
>> manage it. I need to build a classifier for CV. In order to do this I 
>> assume that I need to build a model file containing a set of skills. 
>> I have a list of skills but I don't know how to build the input file. 
>> Here is a sample of my input file:
>>
>> Tiles and clinkers, setting experience Tile layer .
>> Silk screen printing Lead typesetter, printing shop .
>> CTI, computer telephony Alarm operator .
>> GifBuilder animation program Specialist book writer .
>> Gardening, study circle leadership Sports centre manager .
>> ........
>> etc.
>>
>> The first part, until the next capital letter is the skill name and 
>> the second part is the job name.
>> Ex: Gardening, study circle leadership - skill name, Sports centre 
>> manager
>> - job name.
>>
>> In order to create the actual training file I use the following command:
>>
>> opennlp DoccatTrainer -encoding UTF-8 -lang en -data /tmp/jobs.txt 
>> -model /tmp/en-language-jobs.bin
>>
>> Now, my question is if the input file I am providing to the above 
>> command has the right format.
>>
>> Also, please note that I was able to create the training file but 
>> when running the command
>>
>> opennlp Doccat  /tmp/en-language-jobs.bin < /tmp/programmer.txt the 
>> results are 100% irrelevant.
>>
>> Best regards,
>> Florin
>>
>>
>

Reply via email to