On 1/4/2013 10:32 AM, Adithya .R wrote:
Hi All,
I am trying to parse a string into sentences using the sentence detector.

The data is in english, UTF-8 format, and has many abbreviations (medical
text).

I need the sentence detector to accept a list of abbreviations. I am using
the Dictionary Class like this:

Dictionary abbrDict = new Dictionary();

         try {
             //abbrDict = new Dictionary( FileInputStream(new
File(pathToAbbr)));
             abbrString = readFile(pathToAbbr).replaceAll("(\\t|\\r?\\n)+",
" ");
             for (String abbr : abbrString.split(" ")) {
                 StringList abbrList = new StringList(abbr);
                 System.out.println( abbrList.getToken(0) );
                 abbrDict.put(abbrList);

             }
         } catch (Exception ex) {
             ex.printStackTrace();
         }

         System.out.println( abbrDict.size() + " is the size of dict "  +
abbrDict.toString() );

_______________________________________________________________________________

The out put of the last line looks like this:
9 is the size of dict [[L.M.P.], [D.O.A.], [L.S.A.], [R.S.T.], [A.G.A.],
[R.F.P.], [R.S.P.], [S.L.P.], [R.F.A.]]

My question is is this the right way to do it? If yes, how come the
sentence detector still does not split sentences properly with these
abbreviations.

Any help would be appreciated.

Adi

Adi,

Which sentence detector are you trying to use?

I've been able to train the sentence detector model with many sentences and it has managed to figure out how to handle the abbreviations... like Inc., etc., and others.

James

Reply via email to