Package: wnpp Severity: wishlist Owner: Joost van Baal <joostvb-debian-bugs-2010122...@mdcc.cx>
* Package name : ucto Upstream Author : ILK Research Group, Tilburg University, http://ilk.uvt.nl * URL : http://ilk.uvt.nl/mbt/ * License : GPL-3 Programming Lang: C++ Description : Unicode Tokenizer Ucto can tokenize UTF-8 encoded text files (i.e. separate words from punctuation, split sentences, generate n-grams), and offers several other basic preprocessing steps (change case, count words/characters and reverse lines) that make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. . Ucto is a product of the ILK Research Group, Tilburg University (The Netherlands). . If you are interested in machine parsing of UTF-8 encoded text files, e.g. to do scientific research in natural language processing, ucto will likely be of use to you. ---- Upstream has not yet officially released ucto; currently there's just an obsolete prerelease snapshot and some promissing code in SVN (not git). See also https://github.com/proycon , http://proylt.anaproy.nl/en/software/ and http://proylt.anaproy.nl/media/software/ . The frog package (See Bug#605905: ITP: frog -- tagger and parser for Dutch language) will depend upon ucto. Frog will be the new name and reincarnation of tadpole, see http://ilk.uvt.nl/tadpole/ . Bye, Joost -- irc:joos...@{oftc,freenode} ∙ http://mdcc.cx/ ∙ http://ad1810.com/
signature.asc
Description: Digital signature