We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by detokenizing it, and creating documents from it.
If it is OK to do it with other language, the AD corpus has paragraph and text annotations, as well as the original sentences (not tokenized). On Fri, Mar 22, 2013 at 8:41 AM, Jörn Kottmann <[email protected]> wrote: > Hello, > > do we have any public data we can test the sentence detector and tokenizer > on? > It would be nice to remove the private data test for these at some point. > > Jörn > > > On 03/08/2013 03:11 PM, William Colen wrote: > >> Hi all, >> >> Our second release candidate is ready for testing. RC1 failed to pass the >> initial quality check. >> >> The RC 2 can be downloaded from here: >> http://people.apache.org/~**colen/releases/opennlp-1.5.3/**rc2/<http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/> >> >> To use it in a maven build set the version for opennlp-tools or >> opennlp-uima to 1.5.3, and for opennlp-maxent to 3.0.3, and add this URL >> to >> your settings.xml file: >> https://repository.apache.org/**content/repositories/** >> orgapacheopennlp-005/<https://repository.apache.org/content/repositories/orgapacheopennlp-005/> >> >> The current test plan can be found here: >> https://cwiki.apache.org/**OPENNLP/testplan153.html<https://cwiki.apache.org/OPENNLP/testplan153.html> >> >> Please sign up for tasks in the test plan. >> >> The release plan can be found here: >> https://cwiki.apache.org/**OPENNLP/**releaseplanandtasks153.html<https://cwiki.apache.org/OPENNLP/releaseplanandtasks153.html> >> >> The RC contains quite some changes, please refer to the contained issue >> list for details. >> >> William >> >> >
