DEP-5: an example parser, choice of syntax for Files:
Given that DEP-5 is supposed to be about machine- readability, I thought it would be worthwhile trying to write something to parse the proposed format. Please find attached a short python script that I have written based on the current text of DEP-5 at dep.debian.net[1]. It's designed to be run from an unpacked and patched source package (or at least a source tree containing debian/copyright, which it attempts to parse). It will print out a list of each Files: stanza found in the copyright, followed by the list of files which it believes are matched by the stanza. It has proven useful to me: I found several bugs in a copyright file I'd written for a real live package, based on my misinterpretation of the current wording. Whilst writing this, I found the syntax chosen for the Files: field to be very awkward. Indeed my crude parser only handles a subset of the syntax so far (no escapes, no handling of quoted strings). Most of the examples given in DEP-5 containing the path character will not work, either, e.g. Files: debian/* Assuming they are passed into a find(1) invocation like so find . -path 'debian/*' (note the presence of the path separator and the wording about that in the text) they need to be prefixed with './', even if you omit '.' in the find execution (which itself is a GNUism iirc). Patch attached. I think I would much prefer using regular expressions here. For one thing I'm worried about variations in find(1) behaviours across platforms. For another, unless a parser calls find(1) (as I have, and it's expensive), trying to match its behaviour will imho be a lot more error prone than using your languages built-in regular expression library or pcre or whatever. I will try to cook a patch for comment. [1] (I need to re-read the older DEP-5 messages to understand the current maintainership situation: I see Steve remove the other drivers in that version, and Charles do the same in his git repo...) -- Jon Dowland #!/usr/bin/python # a crude DEP-5 parser # Copyright (c) 2009 Jon Dowland # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and this # notice are preserved. # usage: run the script from within an unpacked source tarball with the debian # diff.gz applied on top (or at least, a DEP-5-syntax debian/copyright file # available) from email import parser from sys import exit from os import popen ## ## step 1: handle/parse RFC822 superset # remove blank lines so the parser treats it all as an email header copyright = parser.Parser().parsestr( ''.join( filter(lambda x: "\n" != x, open("debian/copyright").readlines() ))) if len(copyright.keys()) < 1: print "parser didn't get any headers from the copyright file" exit(1) ## ## step 2: interpret the headers and build a list of tuples ## (files, license, copyright) # DEP5 header. Format-Specification is required. Others are optional. valid = "Format-Specification Name Maintainer Source Disclaimer".split() header = dict([ [x,''] for x in valid]) files = "Files Copyright License".split() # first loop: handle the header for i in range(0,len(copyright.items())): key = copyright.keys()[i] # skip over x-Arbitrary: headers if key[0] == 'x': continue if key in valid: if header[key]: print "error: redefinition of '%s'." % key exit(1) header[key] = copyright.values()[i] continue # this marks the transition from the header onwards if key in files: if not header['Format-Specification']: print "error: Format-Specification must be defined " +\ "before the Files section" exit(1) break print "unrecognised key '%s'" % key exit(1) # second loop: looping through the main parts current = dict([ [x,''] for x in files]) tuples = [] # take a hash of Files/Copyright/License and split it up # into multiple ones based on the Files key # first rule: multiple items separated by commas # XXX: unhandled: escaped commas; quoted-strings #containing commas def append(tuples, current): for t in current['Files'].split(","): c = current.copy() c['Files'] = t.strip() tuples.append(c) for i in range(i,len(copyright.items())): key = copyright.keys()[i] # skip over x-Arbitrary: headers if key[0] == 'x': continue if key in files: # handle implicit 'Files: *' if 'Files' != key and not current['Files']: current['Files'] = '*' # new Files: stanza ends the last one elif 'Files' == key and current['Files']: for defn in ['License', 'Copyright']: if not current[defn]: prin
Re: DEP-5: an example parser, choice of syntax for Files:
Am Sonntag, den 13.09.2009, 23:58 +0100 schrieb Jon Dowland: > Most of the examples given in DEP-5 containing the path > character will not work, either, e.g. > > Files: debian/* > > Assuming they are passed into a find(1) invocation like so > > find . -path 'debian/*' > > (note the presence of the path separator and the wording > about that in the text) > > they need to be prefixed with './', even if you omit '.' in > the find execution (which itself is a GNUism iirc). Patch > attached. You can get rid of those './' by replacing . with *: find * -path 'debian/*' Cheers, Benjamin signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: DEP-5: an example parser, choice of syntax for Files:
Benjamin Drung wrote: > You can get rid of those './' by replacing . with *: > > find * -path 'debian/*' That's not entirely equivalent; shells do not expand globs to include files beginning with "." by default. Jon, I would suggest using os.walk and fnmatch - it will cover this issue and will remove the requirement to execute a subprocess. Regards, -- ,''`. : :' : Chris Lamb `. `'` la...@debian.org `- signature.asc Description: PGP signature
Re: DEP-5: an example parser, choice of syntax for Files:
Le Sun, Sep 13, 2009 at 11:58:46PM +0100, Jon Dowland a écrit : > Please find attached a short python script that I have written based on the > current text of DEP-5 at dep.debian.net[1]. > It will print out a list of each Files: stanza found in the copyright, > followed by the list of files which it believes are matched by the stanza. Hello again, thank you for sharing this parser. I think that such program will be very useful in the future, for instance to generate a static list of files with their license, that is refreshed at each upstream upgrade. The diff of this file would be more informative than a simple debdiff, as it would give more context about the new files, and therefore help us to spot the inclusion of non-free, non-redistributable or non-compatible material at each upgrade. I think that DEP 5 still needs some work, so I hope that you will not be too annoyed if some changes during the discussions on the draft will break your parser from time to time. As you noted, one of the things to change is the syntax of the Files field. It actually has been debated in length in a previous thread this year, and while there was no unanimity on how to handle special characters, I think that there was a clear majority in favor to drop the syntax based on find. Here is a link to the discussion thread. http://lists.debian.org/msgid-search/200906070044.24583.mcita...@gmail.com Here is what I think summarises best the discussion: Files: List of space-separated pathnames indicating files that have the same licence. Question marks indicate any character and asterisks indicate any string of characters. And here is an extension I suggest to make this field optional, as I share the concerns of the GNOME, KDE, and Iceweasel maintainers about not doing impossible file–copyright listings on the largest packages. When this field is omitted in the first paragraph containing a `License` field, its value will be assumed to be '*'. When omitted in other paragraphs, its value is assumed to be 'unknown'. If multiple `Files` declarations match the same file, then only the last match counts. My point of view as a DEP driver is that there was already a lot of discussion on the subject, that is enough to justify a modification of the DEP draft to replace the ‘find’ syntax by a simple list with a few jocker characters that are pastable to the shell. This would have saved you some time, as the version you work on is clearly non-consensual. But Steve and I disagree on how to use the DEP Subversion repository: I would like to commit directly, but Steve, who was the first to setup the DEP 5 markdown version and who therefore has the last word, wants all commits to be done through bzr-svn. I had enough headaches and we already lost two drivers because bzr is a matter of taste, and in our case we were three out of four to strongly dislike it, but in addition I realised that bzr-svn rewrites the history of the Subversion repository, and for this reason I simply refuse to use it. Although I am sure it is not intended, it puts Steve's name on my work, and in addition it puts my name on commit comments on works I have never touched. See the history of DEP 4 for instance. I hope that this helps to understand the sad situation. In the time waiting for a gentlemen agreement, instead of forking, I have set up an exploratory version in git where I try to collect improvements, most of which having been discussed, and that I test on my own packages. This repository is writable to all DDs, and in contrary to the DEP repo, to all Alioth guests as well. Everybody is welcome to contribute! Everybody is also welcome to add their name to the Drivers field if they see themselves as drivers, in the spirit of DEP 0. Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: DEP-5: an example parser, choice of syntax for Files:
[ brief response, I hope to respond in more detail to much of this thread soon ] On Tue, Sep 15, 2009 at 08:54:05AM +0900, Charles Plessy wrote: > I think that DEP 5 still needs some work, so I hope that > you will not be too annoyed if some changes during the > discussions on the draft will break your parser from time > to time. Not at all, it is to be expected. Actually I am going to experiment to see whether the parser can fit into an Appendix, or whether it would be too unwieldy. (and I would ultimately like to write a test suite, in the style of feedvalidator.org etc., which would clearly not fit into the text). > As you noted, one of the things to change is the syntax of > the Files field. It actually has been debated in length in > a previous thread this year > Steve and I disagree on how to use the DEP Subversion > repository Thank you for the references and background: I think it's time I explored the prior discussion on DEP-5 in more depth. -- Jon Dowland signature.asc Description: Digital signature
Re: DEP-5: an example parser, choice of syntax for Files:
apologies, i didn't find this thread until i talked on #debian-devel today, so um... i wrote my own :) http://pyjamas.svn.sourceforge.net/viewvc/pyjamas/trunk/contrib/copyright_check.py?view=log pyjamas has 2,000 files, from a wide range of projects and sources: (fckeditor, python, random win32 mailing lists to name a few)) i sure as s**t wasn't going to check them manually. so i wrote copyright_check.py it's designed to match the debian/copyright file with the files that it finds in the "Files: " sections, looking for their copyright notices (as best can be found), then doing fuzzy-matching on the authors listed in the debian/copyright sections and those *actually* found in the files. [licensecheck would be a nice-to-have addition to the mix]. somehow, despite my boolean-logic dyslexia, i think i managed to print out only those copyright holders found _not_ listed in the debian/copyright file. there are limitations (listed at the top of the file) but copyright_check.py even managed to find the copyright holders listed in some .gif files. anyway - i'm happy to use this for pyjamas: it's made available to anyone else who might want to do something with it. l. -- View this message in context: http://www.nabble.com/DEP-5%3A-an-example-parser%2C-choice-of-syntax-for-Files%3A-tp25428186p25530132.html Sent from the Debian Devel mailing list archive at Nabble.com. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org