Please excuse this long mail. I have read several tutorials and googled for three days, but haven't made any real progress on this question, probably because I'm an absolute novice at python. I'd be very grateful for some help.
1. My problem: I have several files in a structured database format. They contain entries like this: Type de notice : monographie Auteur(s) : John Doe Titre(s) : Argl bargl Publication : Denver, University of Colorado Press, 1776 Type de notice : article Auteur(s) : Richard Doe Titre(s) : wurgl burgl Type de notice : recueil Titre(s) : orgl gorgl I want to translate this into a BibTeX format. My approach was to read the file in by paragraphs, then extract the values of the fields that interest me and write these values to another file. I cannot go line by line since I want to reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to generate a key for every item, in the form of "doeargl" or "doewurgl" (via the split and join functions) The problem is that not every entry contains every field (in my example, #3 doesn't have an author), so I guess I need to test for the existence of these fields before I can use their values. 2. The approach: There is code here http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/ which allows to read a file by paragraphs. I copied this to my script: class FileIterator(object): """ A general purpose file object iterator cum file object proxy """ def __init__(self, fw): self._fw = fw def readparagraphs(self): """ Paragraph iterator """ # This re-uses Alex Martelli's # paragraph reading recipe. # Python Cookbook 2nd edition 19.10, Page 713 paragraph = [] for line in self._fw: if line.isspace(): if paragraph: yield "".join(paragraph) paragraph = [] else: paragraph.append(line) if paragraph: yield "".join(paragraph) When I now run a very basic test: for item in iter.readparagraphs(): print item The entire file is reprinted paragraph by paragraph, so this code appears to work. I can also match the first line of every paragraph like so: reBook = re.compile('^Type de notice : monographie') for item in iter.readparagraphs(): m1 = reBook.match(item) if m1: print "@Book{," this will print a line @Book{, for every "monographie" in the database -- a good start, I thought! 3. The problem that's driving me insane But as soon as I try to match anything inside of the paragraph: reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$') m2 = reAuthor.match(item) if m2: author = m2.group('author') print "author = {%s}," % author I get no matches at all. I have tried to remove the ^ and the $ from the regex, or to add the "re.DOTALL" flag, but to no avail. 4. My aim I would like to have dictionary with fixed keys (the BibTeX field) and values extracted from my file for every paragraph and then write this, in a proper format, to a bibtex file. If a paragraph does not provide a value for a particular key, I could then, in a second pass over the bibtex file, delete these lines. But that means I first have to match and extract the values from my parapgraphs. What am I doing wrong? Or is the entire approach flawed? What alternative method would you suggest? Thanks for any help on this Thomas _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor