[ https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll reassigned LUCENE-2001: --------------------------------------- Assignee: Grant Ingersoll > wordnet parsing bug > ------------------- > > Key: LUCENE-2001 > URL: https://issues.apache.org/jira/browse/LUCENE-2001 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Affects Versions: 2.9 > Reporter: Robert Muir > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 2.9.1, 3.0 > > Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, > LUCENE-2001_branch.patch > > > A user reported that wordnet parses the prolog file incorrectly. > Also need to check the wordnet parser in the memory contrib for this problem. > If this is a false alarm, i'm not worried, because the test will be the first > unit test wordnet package ever had. > {noformat} > For example, looking up the synsets for the > word "king", we get: > java SynLookup wnindex king > baron > magnate > mogul > power > queen > rex > scrofula > struma > tycoon > Here, "scrofula" and "struma" are extraneous. This happens because, the line > parser code in Syns2Index.java interpretes the two consecutive single quotes > in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as > termination > of the string and separates into "king". This entry concerns > synset of words "scrofula" and "struma", and thus they get inserted in the > synset of "king". *There 1382 such entries, in wn_s.pl* and more in other > WordNet > Prolog data-base files, where such use of two consecutive single quotes > appears. > We have resolved this by adding a statement in the line parsing portion of > Syns2Index.java, as follows: > // parse line > line = line.substring(2); > * line = line.replaceAll("\'\'", "`"); // added statement* > int comma = line.indexOf(','); > String num = line.substring(0, comma); ... ... etc. > In short we replace "''" by "`" (a back-quote). Then on recreating the > index, we get: > java SynLookup zwnindex king > baron > magnate > mogul > power > queen > rex > tycoon > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org