Hi Will,
Retraining the relation extractor should be fairly easy. The
instructions I am about to give you apply if you are using cTAKES 3.0.
However, if you are planning to use the trunk version, my instructions
may no longer be accurate. Relation extraction has undergone some
changes recently in connection with cTAKES-190 issue and I don't fully
understand these most recent changes yet (but I am working on it).
1. Run PreprocessAndWriteXmi in the eval package, specifying the
location of the text of the notes, the location of the gold standard
relation
annotations, and the output directory. This class will run all the
preprocessing that is required for relation extraction and add gold standard
relation annotations to the CAS. The resulting CASes will be saved to
disk as XMI files.
2. Run RelationExtractorEvaluation, passing it the location of the XMI
files obtained in the previous steps and --grid-search option. This
class will use the annotations in the XMI files to find the optimal
training parameters using grid search and n-fold cross-validation. After
the execution completes, record the best set of parameters found by the
grid search. If you don't have a lot of time, this step can be skipped
(you can just use the default SVM parameters).
3. Update the model parameters in the main() method of
RelationExtractorTrain (pipelines package) to the values found by the
grid search. Run RelationExtractorTrain, specifying the location of the
XMI files. This class will (a) create a model that is necessary for
deployment of the relation module, and (b) create the descriptor files
which will ensure that the the relation AEs can be used as a part of a
UIMA pipeline.
If you are planning to annotate your data, it might be easier to use
Knowtator since we already have a gold standard reader for Knowtator. If
you want to use a different annotation tool, you just have to make sure
you add the manual annotations to the gold view of the XMI files. The
relation extractor reads the gold standard annotations from the gold view.
Hope this helps,
Dima
On 08/29/2013 06:07 PM, William Karl Thompson wrote:
Hello all,
I'm interested in training the relation extractor on some annotated notes from
Northwestern clinical data, and I understand that cleartk is currently being
used for this purpose in the cTAKES project. Could someone provide some
pointers on how to go about using cleartk to train models that can then be
invoked by a cTAKES module? Again, my focus for now is on the relation
extractor. In case it's relevant, I'm intending to use the brat rapid
annotation tool (http://brat.nlplab.org/) to generate a gold standard corpus.
Cheers,
Will
--
Dmitriy Dligach, PhD
Research Fellow
Children's Hospital Informatics Program
Boston Children's Hospital and Harvard Medical School
(617) 919-3596