Xiao Yu Michael Yang wrote: > Hi tutors, > > I am currently working on a project that identifies languages of html > documents, using Python, of course.
You might be interested in http://chardet.feedparser.org/ which seems to work directly on HTML. > Just wondering, given a string: > > str = "<html> title this is french 77 992 / <aaabbbccc> </html>" Note that str is the name of the built-in string type and not a good choice for a variable name. > what is the python expression for: > > 1. r = return_anything_that's_within<> (str), i.e. it should give "html, > aaabbbccc, html" You can do this with a regular expression: In [1]: s = "<html> title this is french 77 992 / <aaabbbccc> </html>" In [2]: import re In [3]: re.findall('<.*?>', s) Out[3]: ['<html>', '<aaabbbccc>', '</html>'] If you are trying to strip the tags from the HTML, try one of these: http://www.oluyede.org/blog/2006/02/13/html-stripper/ http://www.aminus.org/rbre/python/cleanhtml.py > > 2. r = remove_all_numbers(str), (what is the python expression for > 'is_int') i.e. it removes "77" and "992" What should r look like here? Is it the string s with digits removed, or some kind of list? s.isdigit() will test if s is a string containing all digits. > > 3. dif = listA_minus_listB(str, r), i.e. should return ['77', '992'], > using the above 'r' value. You seem to be confused about strings vs lists. s is a string, not a list. If you have two lists a and b and you want a new list containing everything in a not in b, use a list comprehension: [ aa for aa in a if aa not in b ] If what you are looking for is all the number strings from s, you can use a regular expression again: In [4]: re.findall(r'\d+', s) Out[4]: ['77', '992'] Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor