What about this:

http://nlp.stanford.edu/links/statnlp.html


after reading this page:

http://www.natlang.com/nlp-datasets-download


I've found those:
http://pascallin.ecs.soton.ac.uk/Challenges/RTE/Datasets/

http://dblp.uni-trier.de/db/

http://www.cs.umass.edu/~mccallum/code-data.html

This is training data from the GENIA version 3.02 corpus.

<http://www.natlang.com/nlp-datasets-download>
<http://pascallin.ecs.soton.ac.uk/Challenges/RTE/Datasets/>
<http://dblp.uni-trier.de/db/>
<http://www.cs.umass.edu/~mccallum/code-data.html>

   - Training Data<http://natlang.com/sites/default/files/Genia4ERtaskV2.tar.gz>
(Genia4ERtaskV2.tar.gz
      - 2,242 KiB)
      - Evaluation
Data<http://natlang.com/sites/default/files/Genia4EReval.tar.gz>
(Genia4EReval.tar.gz
      - 840 KiB

Some more:
http://nlp.stanford.edu/links/statnlp.html
http://www.natlang.com/natlang/


we can contact the universities and ask them to use thier data sets



On Wed, Jun 8, 2011 at 2:42 AM, James Kosin <[email protected]> wrote:

> Hi Eldad,
>
> Sorry for the late response....
>
> 1)  Yes, I also have similar success and failure with the NameFinder.
> Hopefully, we can come up with better training data.  The training data
> is simple for the NameFinder... basically, the NameFinder expects that
> the document has already been parsed with the Sentence Detector and the
> Tokenizer; though it isn't 100% required if you are training your own
> applications.
>
> Say you wanted to use the "Hi James," below although not a complete
> sentence, you would have the items on a separate line with the tokenizer
> actually producing the result of "Hi James ," ... notice the space
> between the James and the ','.  The NameFinder expects the data
> tokenized as follows "Hi <START:person> James <END> ," ... notice the
> <START> and <END> tags for the sentence or partial in this case.  The
> older models used just <START> and <END> without the qualifier
> specifying the type of tag.
>
> We've also found if you put "Mr" or "Mrs" prefixes to the name it also
> seems to recognize the names easier.  Most of the training has been done
> on news articles and not everyday text.
>
> Jorn just started a project that the group has been discussing over many
> runs that involves collecting and parsing openly free data for the
> corpus.   https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
> Please feel free to join the discussion and help with the tasks.  We are
> trying to provide open training sets to help with the issues of
> customizing and other issues related to using the copyrighted material
> for the models.
>
> James
>
>
> On 6/5/2011 6:52 AM, Eldad Yamin wrote:
> > Hi James,
> >
> > Thank you for your great response!
> >
> > 1. I already used the command (as described in the documentation) and got
> > some nice results.
> >
> > The only problem that I've found is with the NameFinder, It didn't
> > recognizer different names.
> >
> > Can you please explain how can I use the trainer to "make" him recognize
> > more names (Peoples, Places etc.)?
> >
> >
> > 2. Linked documents, in other words is related articals, for example (in
> > GATE):
> >
> > http://gate.ac.uk/biz/customers.html
> >
> > read the first paragraph under "media"
> >
> >
> >
> > 3. In addition, I have access to lots of texts/books that written in
> Hebrew,
> > how can I use it to train the nameFinder (I will contribute it back)?
> >
> > an again, tahnk you very much!
> >
> > On Sun, Jun 5, 2011 at 2:04 AM, James Kosin <[email protected]>
> wrote:
> >
> >> Eldad,
> >>
> >> It is possible.
> >> 1)  This is easy enough with the current architecture and models.
> >> Basically, you have to pass in the document or paragraphs and parse into
> >> sentences using the SentenceDetector, which detects the sentences in the
> >> paragraph and returns a String array of sentences.  Next the output from
> >> the sentence detector needs to be put through the Tokenizer, which takes
> >> the sentences and tokenizes into smaller parts.  Usually words, but it
> >> also moves punctuation away from the words as well.  This is done for
> >> each sentence and returns a string list of tokens.   From here you have
> >> the raw data needed for most of the other models.  From your
> >> description, you will want to use the NameFinder and the supporting
> >> models to tag the people, locations, and organizations and the like.
> >>
> >> 2)  Not sure what you mean by link documents to others....
> >>
> >> 3)  We don't yet support all languages at the moment.  Mostly because
> >> training and test data need to be collected over many months and parsed
> >> to be trained.  Many groups have already done some work; unfortunately,
> >> most is copyrighted and difficult for everyone to get in some cases.
> >>
> >> This should get you started.
> >> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
> >>
> >> Download the release here...  Don't forget the models toward the bottom.
> >> http://incubator.apache.org/opennlp/download.cgi
> >>
> >> Let us know if you need anything else.
> >>
> >> James
> >>
> >>
> >> On 6/4/2011 12:30 PM, Eldad Yamin wrote:
> >>> Hello everyone,
> >>> After researching about NLP I have found the OpenNLP as one of the most
> >>> promising solution at the moment.
> >>> however, I'm still looking for instruction on how to make the OpenNLP
> fit
> >> to
> >>> my needs.
> >>>
> >>> I need the OpenNLP to:
> >>> 1. get as input a sentence/paragraph and in return IE, annotation,
> named
> >>> entities (people, locations, organizations) and   (numbers, dates, etc
> >> .).
> >>> 2. to use the OpenNLP to link documents to others.
> >>> 3. to support multi languages.
> >>>
> >>> Please advise,
> >>> Eldad.
> >>>
> >>
>
>

Reply via email to