[
https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Joseph Masanz updated CTAKES-155:
---------------------------------------
Fix Version/s: (was: 3.2.3)
future enhancement
> SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters
> -------------------------------------------------------------------------
>
> Key: CTAKES-155
> URL: https://issues.apache.org/jira/browse/CTAKES-155
> Project: cTAKES
> Issue Type: Bug
> Components: ctakes-core
> Affects Versions: 3.0-incubating
> Reporter: Steven Bethard
> Fix For: future enhancement
>
>
> The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I
> believe it assumes all sections are 5 characters long here:
> {code:java}
> fileReader.read(sectIdArr, 0, 5);
> {code}
> As a result, when the section name is longer than that, some part of the
> section heading (e.g. for a 6 letter section name, the final "]") is left in
> the text of the next section. This results, for example, in the dependency
> parser choking:
> {code:java}
> Caused by: java.lang.NullPointerException
> at clear.pos.PosEnLib.isNoun(PosEnLib.java:56)
> at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273)
> at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247)
> {code}
> I would fix this but:
> (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's
> documentation actually says "Creates a single segment annotation that spans
> the entire document" which is just untrue, so I'm not really sure what this
> annotator is intended to do.
> (2) Even if I make some assumptions about what it's intended to do, the code
> is written in an extremely brittle fashion, and I'm afraid to make changes to
> that. For what it's worth, here's what I think the annotator should really
> look like:
> {code:java}
> public static class SegmentsFromBracketedSectionTagsAnnotator extends
> JCasAnnotator_ImplBase {
> private static Pattern SECTION_PATTERN =
> Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end
> section id=\"?(.*?)\"?\\])", Pattern.DOTALL);
> @Override
> public void process(JCas jCas) throws AnalysisEngineProcessException {
> Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText());
> while (matcher.find()) {
> Segment segment = new Segment(jCas);
> segment.setBegin(matcher.start() + matcher.group(1).length());
> segment.setEnd(matcher.end() - matcher.group(3).length());
> segment.setId(matcher.group(2));
> segment.addToIndexes();
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)