I think simple regex may come handy,

  p=re.compile(r'(.+) .*\1')    #note the space
  s=p.search("python and i love python")
  s.groups()
  (' python',)

But that matches for only one double word.Someone else could light up here
to extract all the double words.Then they can be removed from the original
paragraph.

This has multiple problems:

>>> p = re.compile(r'(.+) .*\1')
>>> s = p.search("python one two one two python")
>>> s.groups()
('python',)
>>> s = p.search("python one two one two python one")
>>> s.groups() # guess what happened to the 2nd "one"...
('python one',)

and even once you have the list of theoretical duplicates (by changing the regexp to r'\b(\w+)\b.*?\1' perhaps), you still have to worry about emitting the first instance but not subsequent instances.

-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to