Re: Byte Offsets of Tokens, Ngrams and Sentences?

Gabriel Genellina Fri, 06 Aug 2010 02:52:57 -0700

En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabad...@gmail.com>escribió:

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.


Input:
This is a string.

Output:
This  0
is      5
a       8
string.   10


Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
...   print g.group(), g.start()
...
This 0
is 5
a 8
string. 10

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Byte Offsets of Tokens, Ngrams and Sentences?

Reply via email to