Thomas A. Schmitz wrote: > Please excuse this long mail. I have read several tutorials and > googled for > three days, but haven't made any real progress on this question, > probably > because I'm an absolute novice at python. I'd be very grateful for > some help. > > 1. My problem: > > I have several files in a structured database format. They > contain entries like this: > > Type de notice : monographie > Auteur(s) : John Doe > Titre(s) : Argl bargl > Publication : Denver, University of Colorado Press, 1776 > > Type de notice : article > Auteur(s) : Richard Doe > Titre(s) : wurgl burgl > > Type de notice : recueil > Titre(s) : orgl gorgl > > I want to translate this into a BibTeX format. My approach was to > read the file > in by paragraphs, then extract the values of the fields that interest > me and > write these values to another file. I cannot go line by line since I > want to > reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to > generate a > key for every item, in the form of "doeargl" or "doewurgl" (via the > split and > join functions) The problem is that not every entry contains every > field (in my > example, #3 doesn't have an author), so I guess I need to test for the > existence of these fields before I can use their values. > > 2. The approach: > > There is code here > http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/ > which allows to read a file by paragraphs. I copied this to my script: > > class FileIterator(object): > """ A general purpose file object iterator cum > file object proxy """ > > def __init__(self, fw): > self._fw = fw > > def readparagraphs(self): > """ Paragraph iterator """ > > # This re-uses Alex Martelli's > # paragraph reading recipe. > # Python Cookbook 2nd edition 19.10, Page 713 > paragraph = [] > for line in self._fw: > if line.isspace(): > if paragraph: > yield "".join(paragraph) > paragraph = [] > else: > paragraph.append(line) > if paragraph: > yield "".join(paragraph) > > When I now run a very basic test: > > for item in iter.readparagraphs(): > print item > > The entire file is reprinted paragraph by paragraph, so this code > appears to work.
I would take out the join in this, at least, and return a list of lines. You don't really have a paragraph, you have structured data. There is not need to throw away the structure. It might be even more useful to return a dictionary that maps field names to values. Also there doesn't seem to be any reason to make FileIterator a class, you can use just a generator function (Dick Moores take notice!): def readparagraphs(fw): self._fw = fw data = {} for line in fw: if line.isspace(): if data: yield data data = {} else: key, value = line.split(' : ') data[key] = value if data: yield data Now you don't need a regexp, you have usable data directly from the iterator. > I can also match the first line of every paragraph like so: > > reBook = re.compile('^Type de notice : monographie') > for item in iter.readparagraphs(): > m1 = reBook.match(item) > if m1: > print "@Book{," > > this will print a line @Book{, for every "monographie" in the > database -- a > good start, I thought! > > 3. The problem that's driving me insane > > But as soon as I try to match anything inside of the paragraph: > > reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$') > m2 = reAuthor.match(item) > if m2: > author = m2.group('author') > print "author = {%s}," % author > > I get no matches at all. I have tried to remove the ^ and the $ from > the regex, > or to add the "re.DOTALL" flag, but to no avail. You need re.MULTILINE to modify the meaning of ^ and $. re.DOTALL affects whether . matches newlines. > > 4. My aim > > I would like to have dictionary with fixed keys (the BibTeX field) > and values > extracted from my file for every paragraph and then write this, in a > proper > format, to a bibtex file. If a paragraph does not provide a value for a > particular key, I could then, in a second pass over the bibtex file, > delete > these lines. I would write the code to exclude those lines in the first place. If the dict returned from readparagraphs() is missing a key, then don't write the corresponding line. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor