The following module was proposed for inclusion in the Module List:
modid: Lingua::EN::Tokenizer::Offsets
DSLIP: bdpfp
description: Finds word (token) boundaries, and returns t
userid: ANDREFS (André Fernandes dos Santos)
chapterid: 11 (String_Lang_Text_Proc)
communities:
http://github.com/andrefs/Lingua-EN-Sentence-Offsets/issues
similar:
Lingua::FreeLing3::Tokenizer
rationale:
Tokenizer (word splitter) for English with a twist (does for tokens
what Lingua::EN::Sentence::Offsets does for sentences).
Most tokenizers return either: - the original text with forced
spacing between tokens - some kind of array with the tokens
This module was primarily developed to, instead, return a list of
pairs of start-end offsets for each token. This allows to know where
each token starts and ends without the need of actually splitting
the text.
enteredby: ANDREFS (André Fernandes dos Santos)
enteredon: Sun Jun 3 00:51:05 2012 GMT
The resulting entry would be:
Lingua::EN::Tokenizer::
::Offsets bdpfp Finds word (token) boundaries, and returns t ANDREFS
Thanks for registering,
--
The PAUSE
PS: The following links are only valid for module list maintainers:
Registration form with editing capabilities:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d0b00000_09d9564b03957820&SUBMIT_pause99_add_mod_preview=1
Immediate (one click) registration:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=d0b00000_09d9564b03957820&SUBMIT_pause99_add_mod_insertit=1
Peek at the current permissions:
https://pause.perl.org/pause/authenquery?pause99_peek_perms_by=me&pause99_peek_perms_query=Lingua%3A%3AEN%3A%3ATokenizer%3A%3AOffsets