Re: [Distutils] Maintaining a curated set of Python packages

Wes Turner Thu, 15 Dec 2016 21:21:07 -0800

On Thursday, December 15, 2016, Nick Coghlan <ncogh...@gmail.com> wrote:


> On 16 December 2016 at 05:50, Paul Moore <p.f.mo...@gmail.com
> <javascript:;>> wrote:
> > On 15 December 2016 at 19:13, Wes Turner <wes.tur...@gmail.com
> <javascript:;>> wrote:
> >>> Just to add my POV, I also find your posts unhelpful, Wes. There's not
> >>> enough information for me to evaluate what you say, and you offer no
> >>> actual solutions to what's being discussed.
> >>
> >>
> >> I could quote myself suggesting solutions in this thread, if you like?
> >
> > You offer lots of pointers to information. But that's different.
>
> Exactly. There are *lots* of information processing standards out
> there, and lots of things we *could* provide natively that simply
> aren't worth the hassle since folks that care can provide them as
> "after market addons" for the audiences that considers them relevant.
>
> For example, a few things that can matter to different audiences are:
>
> - SPDX (Software Package Data Exchange) identifiers for licenses
> - CPE (Common Product Enumeration) and SWID (Software Identification)
> tags for published software
> - DOI (Digital Object Identifier) tags for citation purposes
> - Common Criteria certification for software supply chains


These are called properties with RDFS.

It takes very little effort to add additional properties. If the
unqualified attribute is not listed in a JSONLD @context, it can still be
added by specifying a URI


>
> I don't push for these upstream in distutils-sig not because I don't
> think they're important in general, but because I *don't think they're
> a priority for distutils-sig*. If you're teaching Python to school
> students, or teaching engineers and scientists how to better analyse
> their own data, or building a web service for yourself or your
> employer, these kinds of things simply don't matter.


#31 lists a number of advantages.
OTOMH, CVE security reports could be linked to the project/package URI (and
thus displayed along with the project detail page)


>
> The end users that care about them are well-positioned to tackle them
> on their own (or pay other organisations to do it for them), and
> because they span arbitrary publishing communities anyway, it doesn't
> really matter all that much if any given publishing community
> participates directly in the process (the only real beneficiaries are
> the intermediaries that actively blur the distinctions between the
> cooperative communities and the recalcitrant ones).


Linked Data minimizes


>
> > Anyway, let's just agree to differ - I can skip your mails if they
> > aren't helpful to me, and you don't need to bother about the fact that
> > you're not getting your points across to me.
>
> I consider it fairly important that we have a reasonably common
> understanding of the target userbase for direct consumption of PyPI
> data, and what we expect to be supplied as third party services. It's
> also important that we have a shared understanding of how to
> constructively frame proposals for change.


When I can afford the time, I'll again take a look at fixing the metadata
specification once and for all by (1) defining an @context for the existing
metadata, and (2) producing an additional pydist.jsonld TODO metadata
document (because the releases are currently keyed by version), and (3)
adding the model attribute and view to Warehouse.


>
> For the former, the Semantic Web, and folks that care about Semantic
> Web concepts like "Linked Data" in the abstract sense are not part of
> our primary audience. We don't go out of our way to make their lives
> difficult, but "it makes semantic analysis easier" also isn't a
> compelling rationale for change.


Unfortunately, you types are not well-versed in the problems that Linked
Data solves: it's all your data in your schema in your database; and URIs
are far less useful than RAM-local references (pointers).

See: BP-LD


>
> For the latter, some variants of constructive proposals look like:
>
> - "this kind of user has this kind of problem and this proposed
> solution will help mitigate it this way (and, by the way, here's an
> existing standard we can use)"
> - "this feature exists in <third party tool or service>, it's really
> valuable to users for <these reasons>, how about we offer it by
> default?"
> - "I wrote <thing> for myself, and I think it would also help others
> for <these reasons>, can you help me make it more widely known and
> available?"


One could stuff additional metadata in # comments of a requirements.txt,
but that would be an ad-hoc parsing scheme with a SPOF tool dependency.


> They don't look like "Here's a bunch of technologies and organisations
> that exist on the internet that may in some way potentially be
> relevant to the management of a software distribution network", and
> nor does it look like "This data modeling standard exists, so we
> should use it, even though it doesn't actually simplify our lives or
> our users' lives in any way, and in fact makes them more complicated".


Those badges we all (!) add to our README.rst long_descriptions point to
third-party services with lots of potentially structured linked data that
is very relevant to curating a collection of resources: test coverage,
build stats, discoverable documentation which could be searched en-masse,
security vulnerability reports, downstream packages;
but they're unfortunately just <a href> links;
whereas they could be <a href property="URI"> edges that other tools could
make use of.


>
> > Who knows, one day I
> > might find the time to look into JSON-LD, at which point I may or may
> > not understand why you think it's such a useful tool for solving all
> > these problems (in spite of the fact that no-one else seems to think
> > the same...)


It would be logically fallacious of me to, without an understanding of a
web standard graph representation format, suggest that it's not sufficient
(or ideally-suited) for these very use cases.

#31 TODO somewhat laboriously lists ROI;
Though I haven't yet had the time for an impact study.


>
> I *have* looked at JSON-LD (based primarily on Wes's original
> suggestions), both from the perspective of the Python packaging
> ecosystem specifically, as well as my day job working on software
> supply chain management.


I recognize your expertise and your preference for given Linux
distributions.

I can tell you that, while many of the linked data examples describe social
graph applications regarding Bob and Alice, there are very many domains
where Linked Data is worth learning: medicine (research, clinical), open
government data (where tool-dependence is a no-no and a lost opportunity).

When you have data in lots of different datasets, it really starts to make
sense to:
- use URIs as keys
- use URIs as column names
- recognize that you're just reimplementing graph semantics which are
already well-solved (RDF, RDFS, OWL, and now JSONLD because JS)


>
> My verdict was that for managing a dependency graph implementation, it
> ends up in the category of technologies that qualify as "interesting,
> but not helpful". In many ways, it's the urllib2 of data linking -
> just as urllib2 gives you a URL handling framework which you can
> configure to handle HTTP rather than just providing a HTTP-specific
> interface the way requests does [1], JSON-LD gives you a data linking
> framework, which you can then use to define links between your data,
> rather than just linking the data directly in a domain-appropriate
> fashion. Using a framework for the sake of using a framework rather
> than out of a genuine engineering need doesn't tend to lead to good
> software systems.


Interesting analogy.
urllib, urlparse, urllib2, urllib3, requests; and now we have cert hostname
checking.

SemWeb standards are also layered. There were other standards for triples
in JSON that do still exist, but none could map an existing JSON document
to RDF with such flexibility.

Followed by a truism.



>
> Wes seems to think that my perspective on this is born out of
> ignorance, so repeatedly bringing it up may make me change my point of
> view. However, our problems haven't changed, and the nature and
> purpose of JSON-LD haven't changed, so it really won't - the one thing
> that will change my mind is demonstrated popularity and utility of a
> service that integrates raw PyPI data with JSON-LD and schema.org.
>
> Hence my suggestions (made with varying degrees of politeness) to go
> build a dependency analysis service that extracts the dependency trees
> from libraries.io, maps them to schema.org concepts in a way that
> makes sense, and then demonstrate what that makes possible that can't
> be done with the libraries.io data directly.


A chicken-and-egg problem, ironically.

There are many proprietary solutions for aggregating software quality
information; all of which must write parsers and JOIN logic for each
packaging ecosystem's ad-hoc partial graph implementations.

When the key of a thing is a URI, other datasets which reference the same
URI just magically join together:
The justifying metadata for a package in a curated collection could just
join with the actual package metadata (and the aforementioned datasources)


>
> Neither Wes nor anyone else needs anyone's permission to go do that,
> and it will be far more enjoyable for all concerned than the status
> quo where Wes is refusing to take "No, we've already looked at
> JSON-LD, and we believe it adds needless complexity for no benefit
> that we care about" for an answer by continuing to post about it
> *here* rather than either venting his frustrations about our
> collective lack of interest somewhere else, or else channeling that
> frustration into building the system he wishes existed.


If you've never written an @context for an existing JSON schema, I question
both your assessment of complexity and your experience with sharing graph
data with myriad applications;
But that's irrelevant,
Because here all I think I need is a table of dataset-local autoincrement
IDs and some columns
And ALTER TABLE migrations,
And then someone else can write a parser for the schema I expose with my
JSON REST API,
So that I can JOIN this data with other useful datasets
(In order to share a versioned Collection of CreativeWorks which already
have URIs;
Because I'm unsatisfied with requirements.txt
Because it's line-based,
And I can't just add additional attributes,
And there's no key because indexes and editable URLs stuffed with checksums
and eggwheel names,
Oh and the JSON metadata specification is fixed and doesn't support URI
attribute names
So I can't just add additional attributes or values from a controlled
vocabulary as-needed,
Unless it's ReStructuredText-come-HTML
(now with pypi:readme)

Is that in PEP form?

Someone should really put together an industry-group to produce some UML
here,
So our tools can talk,
And JOIN on URIs
With versions, platforms, and custom URI schemes,
In order to curate a collection of packages,
As a team,
With group permissions,
according to structured criteria (and comments!)

Because "ld-signatures"


A #LinkedMetaAnalyses (#LinkedReproducibility) application would similarly
support curation of resources with URIs and defined criteria in order to
elicit redundant expert evaluations of CreativeWorks (likely with JSONLD).
In light of the reception here, that may be a better use of resources.


I've said my piece,
I'll leave you guys to find a solution to this use case which minimizes
re-work and maximizes data integration potential.

[Pip checksum docs, JSONLD]
https://github.com/pypa/interoperability-peps/issues/31#issuecomment-160970112


> Cheers,
> Nick.
>
> [1] http://www.curiousefficiency.org/posts/2016/08/what-
> problem-does-it-solve.html
>
> --
> Nick Coghlan   |   ncogh...@gmail.com <javascript:;>   |   Brisbane,
> Australia
>

_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig

Re: [Distutils] Maintaining a curated set of Python packages

Reply via email to