DEP-5: an example parser, choice of syntax for Files:

2009-09-13 Thread Jon Dowland
Given that DEP-5 is supposed to be about machine-
readability, I thought it would be worthwhile trying to
write something to parse the proposed format.  Please find
attached a short python script that I have written based on
the current text of DEP-5 at dep.debian.net[1].

It's designed to be run from an unpacked and patched source
package (or at least a source tree containing
debian/copyright, which it attempts to parse). It will
print out a list of each Files: stanza found in the
copyright, followed by the list of files which it believes
are matched by the stanza.

It has proven useful to me: I found several bugs in a
copyright file I'd written for a real live package, based
on my misinterpretation of the current wording.

Whilst writing this, I found the syntax chosen for the
Files: field to be very awkward. Indeed my crude parser
only handles a subset of the syntax so far (no escapes, no
handling of quoted strings).

Most of the examples given in DEP-5 containing the path
character will not work, either, e.g.

Files: debian/*

Assuming they are passed into a find(1) invocation like so

find . -path 'debian/*'

(note the presence of the path separator and the wording
about that in the text)

they need to be prefixed with './', even if you omit '.' in
the find execution (which itself is a GNUism iirc).  Patch
attached.

I think I would much prefer using regular expressions here.
For one thing I'm worried about variations in find(1)
behaviours across platforms. For another, unless a parser
calls find(1) (as I have, and it's expensive), trying to
match its behaviour will imho be a lot more error prone
than using your languages built-in regular expression
library or pcre or whatever. I will try to cook a patch for
comment.

[1] (I need to re-read the older DEP-5 messages to
understand the current maintainership situation: I see
Steve remove the other drivers in that version, and
Charles do the same in his git repo...)


-- 
Jon Dowland
#!/usr/bin/python
# a crude DEP-5 parser

# Copyright (c) 2009 Jon Dowland 
# Copying and distribution of this file, with or without modification, are
# permitted in any medium without royalty provided the copyright notice and this
# notice are preserved.

# usage: run the script from within an unpacked source tarball with the debian
# diff.gz applied on top (or at least, a DEP-5-syntax debian/copyright file
# available)

from email import parser
from sys import exit
from os import popen

##
## step 1: handle/parse RFC822 superset

# remove blank lines so the parser treats it all as an email header
copyright = parser.Parser().parsestr(
''.join(
filter(lambda x: "\n" != x,
open("debian/copyright").readlines()
)))
if len(copyright.keys()) < 1:
print "parser didn't get any headers from the copyright file"
exit(1)

##
## step 2: interpret the headers and build a list of tuples
##  (files, license, copyright)

# DEP5 header. Format-Specification is required. Others are optional.
valid = "Format-Specification Name Maintainer Source Disclaimer".split()
header = dict([ [x,''] for x in valid])
files = "Files Copyright License".split()

# first loop: handle the header
for i in range(0,len(copyright.items())):
key = copyright.keys()[i]

# skip over x-Arbitrary: headers
if key[0] == 'x':
continue

if key in valid:
if header[key]:
print "error: redefinition of '%s'." % key
exit(1)
header[key] = copyright.values()[i]
continue

# this marks the transition from the header onwards
if key in files:
if not header['Format-Specification']:
print "error: Format-Specification must be defined " +\
  "before the Files section"
exit(1)
break

print "unrecognised key '%s'" % key
exit(1)

# second loop: looping through the main parts
current = dict([ [x,''] for x in files])
tuples = []

# take a hash of Files/Copyright/License and split it up
# into multiple ones based on the Files key
#   first rule: multiple items separated by commas
#   XXX: unhandled: escaped commas; quoted-strings
#containing commas
def append(tuples, current):
for t in current['Files'].split(","):
c = current.copy()
c['Files'] = t.strip()
tuples.append(c)

for i in range(i,len(copyright.items())):
key = copyright.keys()[i]

# skip over x-Arbitrary: headers
if key[0] == 'x':
continue

if key in files:
# handle implicit 'Files: *'
if 'Files' != key and not current['Files']:
current['Files'] = '*'
# new Files: stanza ends the last one
elif 'Files' == key and current['Files']:
for defn in ['License', 'Copyright']:
if not current[defn]:
prin

Re: DEP-5: an example parser, choice of syntax for Files:

2009-09-13 Thread Benjamin Drung
Am Sonntag, den 13.09.2009, 23:58 +0100 schrieb Jon Dowland:
> Most of the examples given in DEP-5 containing the path
> character will not work, either, e.g.
> 
> Files: debian/*
> 
> Assuming they are passed into a find(1) invocation like so
> 
> find . -path 'debian/*'
> 
> (note the presence of the path separator and the wording
> about that in the text)
> 
> they need to be prefixed with './', even if you omit '.' in
> the find execution (which itself is a GNUism iirc).  Patch
> attached.

You can get rid of those './' by replacing . with *:

find * -path 'debian/*'

Cheers,
Benjamin


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: DEP-5: an example parser, choice of syntax for Files:

2009-09-14 Thread Chris Lamb
Benjamin Drung wrote:

> You can get rid of those './' by replacing . with *:
> 
> find * -path 'debian/*'

That's not entirely equivalent; shells do not expand globs to include
files beginning with "." by default.

Jon, I would suggest using os.walk and fnmatch - it will cover this
issue and will remove the requirement to execute a subprocess.


Regards,

-- 
  ,''`.
 : :'  : Chris Lamb
 `. `'`  la...@debian.org
   `-


signature.asc
Description: PGP signature


Re: DEP-5: an example parser, choice of syntax for Files:

2009-09-14 Thread Charles Plessy
Le Sun, Sep 13, 2009 at 11:58:46PM +0100, Jon Dowland a écrit :
> Please find attached a short python script that I have written based on the
> current text of DEP-5 at dep.debian.net[1].
 
> It will print out a list of each Files: stanza found in the copyright,
> followed by the list of files which it believes are matched by the stanza.

Hello again,

thank you for sharing this parser. I think that such program will be very
useful in the future, for instance to generate a static list of files with
their license, that is refreshed at each upstream upgrade. The diff of this
file would be more informative than a simple debdiff, as it would give more
context about the new files, and therefore help us to spot the inclusion of
non-free, non-redistributable or non-compatible material at each upgrade.

I think that DEP 5 still needs some work, so I hope that you will not be too
annoyed if some changes during the discussions on the draft will break your
parser from time to time.

As you noted, one of the things to change is the syntax of the Files field. It
actually has been debated in length in a previous thread this year, and while
there was no unanimity on how to handle special characters, I think that there
was a clear majority in favor to drop the syntax based on find. Here is a link
to the discussion thread.

http://lists.debian.org/msgid-search/200906070044.24583.mcita...@gmail.com

Here is what I think summarises best the discussion:

 Files:  List of space-separated pathnames indicating files that have the same
 licence. Question marks indicate any character and asterisks indicate any
 string of characters.

And here is an extension I suggest to make this field optional, as I share the
concerns of the GNOME, KDE, and Iceweasel maintainers about not doing
impossible file–copyright listings on the largest packages.

 When this field is omitted in the first paragraph containing a `License` field,
 its value will be assumed to be '*'. When omitted in other paragraphs, its
 value is assumed to be 'unknown'. If multiple `Files` declarations match the
 same file, then only the last match counts.

My point of view as a DEP driver is that there was already a lot of discussion
on the subject, that is enough to justify a modification of the DEP draft to
replace the ‘find’ syntax by a simple list with a few jocker characters that
are pastable to the shell. This would have saved you some time, as the version
you work on is clearly non-consensual. But Steve and I disagree on how to use
the DEP Subversion repository: I would like to commit directly, but Steve, who
was the first to setup the DEP 5 markdown version and who therefore has the
last word, wants all commits to be done through bzr-svn. I had enough headaches
and we already lost two drivers because bzr is a matter of taste, and in our
case we were three out of four to strongly dislike it, but in addition I
realised that bzr-svn rewrites the history of the Subversion repository, and
for this reason I simply refuse to use it. Although I am sure it is not
intended, it puts Steve's name on my work, and in addition it puts my name on
commit comments on works I have never touched. See the history of DEP 4 for
instance. I hope that this helps to understand the sad situation. In the time
waiting for a gentlemen agreement, instead of forking, I have set up an
exploratory version in git where I try to collect improvements, most of which
having been discussed, and that I test on my own packages. This repository is
writable to all DDs, and in contrary to the DEP repo, to all Alioth guests as
well. Everybody is welcome to contribute! Everybody is also welcome to add
their name to the Drivers field if they see themselves as drivers, in the
spirit of DEP 0. 

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: DEP-5: an example parser, choice of syntax for Files:

2009-09-15 Thread Jon Dowland
[ brief response, I hope to respond in more detail to much
of this thread soon ]

On Tue, Sep 15, 2009 at 08:54:05AM +0900, Charles Plessy
wrote:
> I think that DEP 5 still needs some work, so I hope that
> you will not be too annoyed if some changes during the
> discussions on the draft will break your parser from time
> to time.

Not at all, it is to be expected.  Actually I am going to
experiment to see whether the parser can fit into an
Appendix, or whether it would be too unwieldy. (and I
would ultimately like to write a test suite, in the style
of feedvalidator.org etc., which would clearly not fit into
the text).

> As you noted, one of the things to change is the syntax of
> the Files field. It actually has been debated in length in
> a previous thread this year

> Steve and I disagree on how to use the DEP Subversion
> repository

Thank you for the references and background: I think it's
time I explored the prior discussion on DEP-5 in more depth.


-- 
Jon Dowland


signature.asc
Description: Digital signature


Re: DEP-5: an example parser, choice of syntax for Files:

2009-09-20 Thread lkcl

apologies, i didn't find this thread until i talked on #debian-devel today,
so um... i wrote my own :)
http://pyjamas.svn.sourceforge.net/viewvc/pyjamas/trunk/contrib/copyright_check.py?view=log

pyjamas has 2,000 files, from a wide range of projects and sources:
(fckeditor, python, random win32 mailing lists to name a few)) i sure as
s**t wasn't going to check them manually.  so i wrote copyright_check.py

it's designed to match the debian/copyright file with the files that it
finds in the "Files: " sections, looking for their copyright notices (as
best can be found), then doing fuzzy-matching on the authors listed in the
debian/copyright sections and those *actually* found in the files. 
[licensecheck would be a nice-to-have addition to the mix].

somehow, despite my boolean-logic dyslexia, i think i managed to print out
only those copyright holders found _not_ listed in the debian/copyright
file.

there are limitations (listed at the top of the file) but copyright_check.py
even managed to find the copyright holders listed in some .gif files.

anyway - i'm happy to use this for pyjamas: it's made available to anyone
else who might want to do something with it.

l.

-- 
View this message in context: 
http://www.nabble.com/DEP-5%3A-an-example-parser%2C-choice-of-syntax-for-Files%3A-tp25428186p25530132.html
Sent from the Debian Devel mailing list archive at Nabble.com.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org