Hi Open NLP Community,

I have a lot (over than 100 mega) of PDF files with text in.

In this text, there are a lot of references. These references are like that
: name, date, ID. But sometimes, this order is not always the same (for
example ID, date, name or name, ID, date) or all the informations are not
specified (for example name, date or name, ID, or just ID).

By hand, I extracted around one hundred references from the PDF mentionned.

Now, I wish to use OpenNLP which, learning from the list I did by hand,
could extract from all the text in the PDF all the references ( = Name +
Date + ID) it could find.

After reading the documentation (the manual), I don't really know which is
the best way to follow. Could someone give me some advices about the way to
follow (Sentence Detector ? Tokenizer ? regarding the fact that none of
them seem able to do exactly what i'm looking for, or maybe I missed
something ?), i'm a little lost :s.
Thanks very much.

Alex

Reply via email to