Reply inline. On 21/03/13 12:18 AM, Steven D'Aprano wrote: > On 20/03/13 22:38, nishitha reddy wrote: >> Hi all >> i'm working with unicode using python >> i have some txt files in telugu i want to split all the lines of that >> text files in to words of telugu >> and i need to classify all of them using some identifiers.can any one >> send solution for that > > > Probably not. I would be surprised if anyone here knows what Telugu is, > or the rules for splitting Telugu text into words. The Natural Language > Toolkit (NLTK) may be able to handle it. > > You could try doing the splitting and classifying yourself. If Telugu > uses > space-delimited words like English, you can do it easily: > > data = u"ఏఐఒ ఓఔక ఞతణథ" > words = data.split() Unicode characters for telugu: http://en.wikipedia.org/wiki/Telugu_alphabet#Unicode
On python 3.x, >>> import re >>> a='ఏఐఒ ఓఔక ఞతణథ' >>> print(a) ఏఐఒ ఓఔక ఞతణథ >>> re.split('[^\u0c01-\u0c7f]', a) ['ఏఐఒ', 'ఓఔక', 'ఞతణథ'] Similar logic can be used for any other Indic script. HTH. -- शंतनू _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor