[
https://issues.apache.org/jira/browse/CTAKES-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469355#comment-13469355
]
Sean commented on CTAKES-63:
----------------------------
As Pei had indicated to an email that was forwarded to me and I'm including
here for documentation purposes (my response follows):
> After some debugging, this happens when the token contains a dash (-),
> and contains a special char such as the right bracket].
> //I believe all of the chars in the QueryParser str token should be
> escaped to avoid issues such as a token ending with ']'
>
> Before we add and test the proposed fixed (add escape() call) such as
> below, I also noticed another potential issue: we do search and
> replace of all dashes into spaces. Just wanted to ensure that this
> was done intentionally and works fine because the dashes have already
> been removed in the index. Otherwise, we'll need to actually replace
> the dash with a '?' instead of a space or use a phrasequery instead of
> termquery. Would be great if someone familiar with this bit of code to
> confirm...
>
> LuceneDictionaryImpl.java (dictionary-lookup) [~Line 106]
>
> if (str.indexOf('-') == -1) {
> q = new TermQuery(new Term(iv_lookupFieldName, str));
> topDoc = iv_searcher.search(q, iv_maxHits);
> }
> else { // needed the KeyworkAnalyzer for situations
> where the hypen was included in the f-word
> QueryParser query = new
> QueryParser(Version.LUCENE_30, iv_lookupFieldName, new KeywordAnalyzer());
> try {
> //topDoc =
> iv_searcher.search(query.parse(str.replace('-', ' ')), iv_maxHits);
> //proposed fixed
> String escaped =
> QueryParser.escape(str.replace('-', ' '));
> topDoc =
> iv_searcher.search(query.parse(escaped), iv_maxHits);
> } catch (ParseException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
I was the author of the code in question above. Prior versions of cTAKES
utilized dictionary resources that required this work around for situations
when a hyphen was contained in the first term (f-word) being looked up. Part
of the issue was the fact that hyphenated terms would be handled as single
tokens, however, this problem had more to do with how the Lucene dictionary was
built than the content of the dictionary.
After some experimentation I discovered that how the field was indexed played a
role in what would be able to be queried within the string. By using the
following I achieved better results:
document.add(new Field("first_word",
s[0].trim(), Field.Store.YES,
Field.Index.ANALYZED));
> exception formed by malformed email address
> -------------------------------------------
>
> Key: CTAKES-63
> URL: https://issues.apache.org/jira/browse/CTAKES-63
> Project: cTAKES
> Issue Type: Bug
> Components: ctakes-dictionary-lookup
> Affects Versions: 2.6-incubating
> Environment: windows
> Reporter: Chen Lin
> Priority: Critical
> Labels: Stability
>
> 2012-09-21 12:48:36,789 INFO
> edu.mayo.bmi.uima.lookup.ae.UmlsDictionaryLookupAnnotator - process(JCas)
> org.apache.lucene.queryParser.ParseException: Cannot parse 'mailto:abcoman@t
> nec.org]': Lexical error at line 1, column 26. Encountered: <EOF> after : ""
> at
> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:192)
> at
> edu.mayo.bmi.dictionary.lucene.LuceneDictionaryImpl.getEntries(LuceneDictionaryImpl.java:106)
> at
> edu.mayo.bmi.dictionary.DictionaryEngine.metaLookup(DictionaryEngine.java:181)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira