Hello! Ricardo Wurmus <ricardo.wur...@mdc-berlin.de> writes:
> Patch number 3! Yay! >> From 0c62b541a3e8925b5ca31fe55dbe7536cf95151f Mon Sep 17 00:00:00 2001 >> From: Maxim Cournoyer <maxim.courno...@gmail.com> >> Date: Thu, 28 Mar 2019 00:26:01 -0400 >> Subject: [PATCH 3/9] import: pypi: Improve parsing of requirement >> specifications. >> >> The previous solution was fragile and could leave unwanted characters in a >> requirement name, such as '[' or ']'. > > Wouldn’t it be sufficient to add [ and ] to the list of forbidden > characters? The tests pass with this implementation of > clean-requirements: > > (define (clean-requirements s) > (cond > ((string-index s (char-set #\space #\> #\= #\< #\[ #\])) => (cut > string-take s <>)) > (else s))) Indeed this would be sufficient to make the tests pass, but the tests don't cover all the cases; as an example, consider: --8<---------------cut here---------------start------------->8--- argparse;python_version<"2.7" --8<---------------cut here---------------end--------------->8--- While we could make it work with the current logic by adding more invalid characters (such as ';' here) to the character set, it seems less error prone to use the upstream provided regex to match a package name. [0] >> +(define %requirement-name-regexp >> + ;; Regexp to match the requirement name in a requirement specification. >> + >> + ;; Some grammar, taken from PEP-0508 (see: >> + ;; https://www.python.org/dev/peps/pep-0508/). >> + >> + ;; The unified rule can be expressed as: >> + ;; specification = wsp* ( url_req | name_req ) wsp* >> + >> + ;; where url_req is: >> + ;; url_req = name wsp* extras? wsp* urlspec wsp+ quoted_marker? >> + >> + ;; and where name_req is: >> + ;; name_req = name wsp* extras? wsp* versionspec? wsp* quoted_marker? >> + >> + ;; Thus, we need only matching NAME, which is expressed as: >> + ;; identifer_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit) >> + ;; identifier = letterOrDigit identifier_end* >> + ;; name = identifier >> + (let* ((letter-or-digit "[A-Za-z0-9]") >> + (identifier-end (string-append "(" letter-or-digit "|" >> + "[-_.]*" letter-or-digit ")")) >> + (identifier (string-append "^" letter-or-digit identifier-end "*")) >> + (name identifier)) >> + (make-regexp name))) > > This seems a little too complicated. Translating a grammar into a > regexp is probably not a good idea in general. Since we don’t care > about anything other than the name it seems easier to just chop off > the string tail as soon as we find an invalid character. While I agree that a regexp is a bigger hammer than basic string manipulation, I see some merit to it here: 1) We can be assured of conformance with upstream, again, per PEP-0508. 2) It is easier to extend; we might want to add parsing for the version spec in order to disregard dependencies specified for Python < 3, for example. The use of the PEP-0508 grammar to define the regexp is useful to detail in a more human-friendly language the components of the regexp. We could have otherwise used the more cryptic regexp for Python distribution names: --8<---------------cut here---------------start------------->8--- ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$ --8<---------------cut here---------------end--------------->8--- So I guess that what I'm saying is that I prefer this approach to using string-index with invalid characters, for the reasons above. [0] https://www.python.org/dev/peps/pep-0508/ Thanks! Maxim