For many applications, it would be useful to have a universal tagset for any language you are working with. See below for details on a project that provides mappings from many standard treebanks to a course-grained tagset (12 tags). We might want to support these mappings to simple tags in our models (e.g. have a model that uses corpus-native tags and another that uses universal tags).
Jason -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge Hi everyone, some of you have already heard about our universal part-of-speech tagset (and are even using it), to others this might be new. We sat down and read through the annotation guidelines of 25 treebanks and created a mapping to an universal set of 12 coarse-grained part-of-speech categories. We have described the tagset and illustrated some use cases in a short write-up (see attached pdf). Additionally, we have uploaded the mappings to a code repository with version control so that new languages can be added or modification can be made if necessary: http://code.google.com/p/universal-pos-tags/ The paper is for now on arXiv: http://arxiv.org/abs/1104.2086 We hope that you will find this resource useful for your own work. Let us know if you have any comments, Cheers, Dipanjan, Ryan & Slav
