Tomasz, IIRC, the code in SubjectCleartkAnalysisEngine.java should have the feature extractors used- I believe there is an ENUM of a preset of features, but do not recall exactly which one was the best performing for test set- probably best to check the source code.
I think adding the plain sentences examples in Jira would be a great help since we can use that for unit testing at a minimum. Currently, there is no real easy way to 'Append' training data, so one has create the new set with examples in it. The code used for training is also in the project- it should be in the **/eval/* name spaces. I believe the gold standard was created in xml (either knowtator or anafora). Hope that helps. --Pei On Thu, Jul 23, 2015 at 10:33 AM, Tomasz Oliwa <[email protected]> wrote: > What format (features, labels) is best suitable for some more training > examples? > > The SubjectCleartkAnalysisEngine class loads a > /org/apache/ctakes/assertion/models/subject/model.jar, which contains a > liblinear cleartk model. > > The model has 3 features, label 12 3. > > But what are the features exactly are how are they derived? > > How does the target class look like, is is really differentiating between > "patient", "brother", "sister" etc. or is it a binary decision model between > "patient" and "family_history" (the latter is what is looks to me) ? > > This is not documented. > > Tomasz
