Hi Open NLP Community, I have a lot (over than 100 mega) of PDF files with text in.
In this text, there are a lot of references. These references are like that : name, date, ID. But sometimes, this order is not always the same (for example ID, date, name or name, ID, date) or all the informations are not specified (for example name, date or name, ID, or just ID). By hand, I extracted around one hundred references from the PDF mentionned. Now, I wish to use OpenNLP which, learning from the list I did by hand, could extract from all the text in the PDF all the references ( = Name + Date + ID) it could find. After reading the documentation (the manual), I don't really know which is the best way to follow. Could someone give me some advices about the way to follow (Sentence Detector ? Tokenizer ? regarding the fact that none of them seem able to do exactly what i'm looking for, or maybe I missed something ?), i'm a little lost :s. Thanks very much. Alex
