Joon Ki Choi wrote: > > Hello Pythonistas, > > i have a very large textfile with contents like: > > @INBOOK{Ackermann1999-b, > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and > Ackermann, > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, > K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. > and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann}, > year = {1980}, > timestamp = {1995-12-02} > } > > And i want to delete the duplicate rows except these rows containing the > brackets { or }. The result should look like: > > @INBOOK{Ackermann1999-b, > author = {Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and > Ackermann, > Ackermann, K.-F. and Ackermann, K.-F. and Ackermann, K.-F. and Ackermann}, > year = {1980}, > timestamp = {1995-12-02} > } > > I come across with this Python-Skript: > > lines_seen = set() # holds lines already seen > outfile = open("literatur_clean.txt", "w") > for line in open("literatur_dupl.txt", "r"): > if line not in lines_seen: # not a duplicate > outfile.write(line) > lines_seen.add(line) > outfile.close() > > But it deletes also the lines with a closing bracket } and the lines with > the same authordata. Therefor i need the condition of the brackets. > > Could someone point me out to adding this condition? > > Thanks in advance, > Joon
Not what you asked for, but here is something that is quick-and-dirty, too, but tries a bit harder: import re def unique(match): names = match.group()[1:-1].split(",") parts = set(" ".join(author.split()) for author in names) return "{%s}" % ", ".join(parts) if __name__ == "__main__": with open("literatur_dupl.txt") as f: data = f.read() data = re.compile("{[^{}]*}", re.DOTALL).sub(unique, data) with open("literatur_clean.txt", "w") as f: f.write(data) I'm assuming that "very large" means that the file contents still comfortably fit into your computer's memory... -- http://mail.python.org/mailman/listinfo/python-list