bug#24450: [PATCHv2] Re: pypi importer outputs strange character series in optional dependency case.

Maxim Cournoyer Mon, 10 Jun 2019 01:34:31 -0700

Hello!

Ricardo Wurmus <ricardo.wur...@mdc-berlin.de> writes:


> Patch number 3!

Yay!

>> From 0c62b541a3e8925b5ca31fe55dbe7536cf95151f Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <maxim.courno...@gmail.com>
>> Date: Thu, 28 Mar 2019 00:26:01 -0400
>> Subject: [PATCH 3/9] import: pypi: Improve parsing of requirement
>>  specifications.
>>
>> The previous solution was fragile and could leave unwanted characters in a
>> requirement name, such as '[' or ']'.
>
> Wouldn’t it be sufficient to add [ and ] to the list of forbidden
> characters?  The tests pass with this implementation of
> clean-requirements:
>
> (define (clean-requirements s)
>  (cond
>   ((string-index s (char-set #\space #\> #\= #\< #\[ #\])) => (cut 
> string-take s <>))
>   (else s)))

Indeed this would be sufficient to make the tests pass, but the tests
don't cover all the cases; as an example, consider:

--8<---------------cut here---------------start------------->8---
argparse;python_version<"2.7"
--8<---------------cut here---------------end--------------->8---

While we could make it work with the current logic by adding more
invalid characters (such as ';' here) to the character set, it seems
less error prone to use the upstream provided regex to match a package
name.  [0]

>> +(define %requirement-name-regexp
>> +  ;; Regexp to match the requirement name in a requirement specification.
>> +
>> +  ;; Some grammar, taken from PEP-0508 (see:
>> +  ;; https://www.python.org/dev/peps/pep-0508/).
>> +
>> +  ;; The unified rule can be expressed as:
>> +  ;; specification = wsp* ( url_req | name_req ) wsp*
>> +
>> +  ;; where url_req is:
>> +  ;; url_req = name wsp* extras? wsp* urlspec wsp+ quoted_marker?
>> +
>> +  ;; and where name_req is:
>> +  ;; name_req = name wsp* extras? wsp* versionspec? wsp* quoted_marker?
>> +
>> +  ;; Thus, we need only matching NAME, which is expressed as:
>> +  ;; identifer_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
>> +  ;; identifier    = letterOrDigit identifier_end*
>> +  ;; name          = identifier
>> +  (let* ((letter-or-digit "[A-Za-z0-9]")
>> +         (identifier-end (string-append "(" letter-or-digit "|"
>> +                                        "[-_.]*" letter-or-digit ")"))
>> +         (identifier (string-append "^" letter-or-digit identifier-end "*"))
>> +         (name identifier))
>> +    (make-regexp name)))
>
> This seems a little too complicated.  Translating a grammar into a
> regexp is probably not a good idea in general.  Since we don’t care
> about anything other than the name it seems easier to just chop off
> the string tail as soon as we find an invalid character.

While I agree that a regexp is a bigger hammer than basic string
manipulation, I see some merit to it here:

1) We can be assured of conformance with upstream, again, per PEP-0508.
2) It is easier to extend; we might want to add parsing for the version
spec in order to disregard dependencies specified for Python < 3, for
example.

The use of the PEP-0508 grammar to define the regexp is useful to detail
in a more human-friendly language the components of the regexp.  We
could have otherwise used the more cryptic regexp for Python
distribution names:

--8<---------------cut here---------------start------------->8---
^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$
--8<---------------cut here---------------end--------------->8---

So I guess that what I'm saying is that I prefer this approach to using
string-index with invalid characters, for the reasons above.

[0]  https://www.python.org/dev/peps/pep-0508/

Thanks!

Maxim

bug#24450: [PATCHv2] Re: pypi importer outputs strange character series in optional dependency case.

Reply via email to