Re: [Pulp-dev] Lazy for Pulp3

Jeff Ortel Thu, 31 May 2018 15:36:49 -0700


On 05/31/2018 04:39 PM, Brian Bouterse wrote:

I updated the epic (https://pulp.plan.io/issues/3693) to use this newlanguage.
policy=immediate -> downloads now while the task runs (no lazy). Alsothe default if unspecified.policy=cache-and-save -> All the steps in the diagram. Content thatis downloaded is saved so that it's only ever downloaded once.policy=cache -> All the steps in the diagram except step 14. Ifsquid pushes the bits out of the cache, it will be re-downloaded againto serve to other clients requesting the same bits.

These policy names strike me as an odd, non-intuitive mixture. I thinkwe need to brainstorm on policy names and/or additional attributes tobest capture this. Suggest the epic be updated to describe the "modes"or use cases without the names for now. I'll try to follow up withother suggestions.


Also @milan, see inline for answers to your question.

On Wed, May 30, 2018 at 3:48 PM, Milan Kovacik <[email protected]<mailto:[email protected]>> wrote:


    On Wed, May 30, 2018 at 4:50 PM, Brian Bouterse
    <[email protected] <mailto:[email protected]>> wrote:
    >
    >
    > On Wed, May 30, 2018 at 8:57 AM, Tom McKay
    <[email protected] <mailto:[email protected]>> wrote:
    >>
    >> I think there is a usecase for "proxy only" like is being
    described here.
    >> Several years ago there was a project called thumbslug[1] that
    was used in a
    >> version of katello instead of pulp. It's job was to check
    entitlements and
    >> then proxy content from a cdn. The same functionality could be
    implemented
    >> in pulp. (Perhaps it's even as simple as telling squid not to
    cache anything
    >> so the content would never make it from cache to pulp in
    current pulp-2.)
    >
    >
    > What would you call this policy?
    > policy=proxy?
    > policy=stream-dont-save?
    > policy=stream-no-save?
    >
    > Are the names 'on-demand' and 'immediate' clear enough? Are
    there better
    > names?
    >>
    >>
    >> Overall I'm +1 to the idea of an only-squid version, if others
    think it
    >> would be useful.
    >
    >
    > I understand describing this as a "only-squid" version, but for
    clarity, the
    > streamer would still be required because it is what requests the
    bits with
    > the correctly configured downloader (certs, proxy, etc). The
    streamer
    > streams the bits into squid which provides caching and client
    multiplexing.

    I have to admit it's just now I'm reading
    
https://docs.pulpproject.org/dev-guide/design/deferred-download.html#apache-reverse-proxy
    
<https://docs.pulpproject.org/dev-guide/design/deferred-download.html#apache-reverse-proxy>
    again because of the SSL termination. So the new plan is to use the
    streamer to terminate the SSL instead of the Apache reverse proxy?

The plan for right now is to not use a reverse proxy and have theclient's connection terminate at squid directly either via http orhttps depending on how squid is configured. The Reverse proxy inpulp2's design served to validate the signed urls and rewrite them forsquid. This first implementation won't use signed urls. I believe thatmeans we don't need a reverse proxy here yet.



    W/r the construction of the URL of an artifact, I thought it would be
    stored in the DB, so the Remote would create it during the sync.

This is correct. The inbound URL from the client after the redirectwill still be a reference that the "Pulp content app" will resolve toa RemoteArtifact. Then the streamer will use that RemoteArtifact datato correctly build the downloader. That's the gist of it at least.



    >
    > To confirm my understanding this "squid-only" policy would be
    the same as
    > on-demand except that it would *not* perform step 14 from the
    diagram here
    > (https://pulp.plan.io/issues/3693
    <https://pulp.plan.io/issues/3693>). Is that right?
    yup
    >
    >>
    >>
    >> [1] https://github.com/candlepin/thumbslug
    <https://github.com/candlepin/thumbslug>
    >>
    >> On Wed, May 30, 2018 at 8:34 AM, Milan Kovacik
    <[email protected] <mailto:[email protected]>>
    >> wrote:
    >>>
    >>> On Tue, May 29, 2018 at 9:31 PM, Dennis Kliban
    <[email protected] <mailto:[email protected]>>
    >>> wrote:
    >>> > On Tue, May 29, 2018 at 11:42 AM, Milan Kovacik
    <[email protected] <mailto:[email protected]>>
    >>> > wrote:
    >>> >>
    >>> >> On Tue, May 29, 2018 at 5:13 PM, Dennis Kliban
    <[email protected] <mailto:[email protected]>>
    >>> >> wrote:
    >>> >> > On Tue, May 29, 2018 at 10:41 AM, Milan Kovacik
    >>> >> > <[email protected] <mailto:[email protected]>>
    >>> >> > wrote:
    >>> >> >>
    >>> >> >> Good point!
    >>> >> >> More the second; it might be a bit crazy to utilize
    Squid for that
    >>> >> >> but
    >>> >> >> first, let's answer the why ;)
    >>> >> >> So why does Pulp need to store the content here?
    >>> >> >> Why don't we point the users to the Squid all the time
    (for the
    >>> >> >> lazy
    >>> >> >> repos)?
    >>> >> >
    >>> >> >
    >>> >> > Pulp's Streamer needs to fetch and store the content
    because that's
    >>> >> > Pulp's
    >>> >> > primary responsibility.
    >>> >>
    >>> >> Maybe not that much the storing but rather the content views
    >>> >> management?
    >>> >> I mean the partitioning into repositories, promoting.
    >>> >>
    >>> >
    >>> > Exactly this. We want Pulp users to be able to reuse content
    that was
    >>> > brought in using the 'on_demand' download policy in other
    repositories.
    >>> I see.
    >>>
    >>> >
    >>> >>
    >>> >> If some of the content lived in Squid and some lived
    >>> >> > in Pulp, it would be difficult for the user to know what
    content is
    >>> >> > actually
    >>> >> > available in Pulp and what content needs to be fetched
    from a remote
    >>> >> > repository.
    >>> >>
    >>> >> I'd say the rule of the thumb would be: lazy -> squid,
    regular -> pulp
    >>> >> so not that difficult.
    >>> >> Maybe Pulp could have a concept of Origin, where folks
    upload stuff to
    >>> >> a Pulp repo, vs. Proxy for it's repo storage policy?
    >>> >>
    >>> >
    >>> > Squid removes things from the cache at some point. You can
    probably
    >>> > configure it to never remove anything from the cache, but
    then we would
    >>> > need
    >>> > to implement orphan cleanup that would work across two
    systems: pulp
    >>> > and
    >>> > squid.
    >>>
    >>> Actually "remote" units wouldn't need orphan cleaning from the
    disk,
    >>> just dropping them from the DB would suffice.
    >>>
    >>> >
    >>> > Answering that question would still be difficult. Not all
    content that
    >>> > is in
    >>> > the repository that was synced using on_demand download
    policy will be
    >>> > in
    >>> > Squid - only the content that has been requested by clients.
    So it's
    >>> > still
    >>> > hard to know which of the content units have been downloaded
    and which
    >>> > have
    >>> > not been.
    >>>
    >>> But the beauty is exactly in that: we don't have to track
    whether the
    >>> content is downloaded if it is reverse-proxied[1][2].
    >>> Moreover, this would work both with and without a proxy
    between Pulp
    >>> and the Origin of the remote unit.
    >>> A "remote" content artifact might just need to carry it's URL
    in a DB
    >>> column for this to work; so the async artifact model, instead
    of the
    >>> "policy=on-demand"  would have a mandatory remote "URL"
    attribute; I
    >>> wouldn't say it's more complex than tracking the "policy"
    attribute.
    >>>
    >>> >
    >>> >
    >>> >>
    >>> >> >
    >>> >> > As Pulp downloads an Artifact, it calculates all the
    checksums and
    >>> >> > it's
    >>> >> > size. It then performs validation based on information
    that was
    >>> >> > provided
    >>> >> > from the RemoteArtifact. After validation is performed, the
    >>> >> > Artifact, is
    >>> >> > saved to the database and it's final place in
    >>> >> > /var/lib/content/artifacts/.
    >>> >>
    >>> >> This could be still achieved by storing the content just
    temporarily
    >>> >> in the Squid proxy i.e use Squid as the content source, not
    the disk.
    >>> >>
    >>> >> > Once this information is in the database, Pulp's web
    server can
    >>> >> > serve
    >>> >> > the
    >>> >> > content without having to involve the Streamer or Squid.
    >>> >>
    >>> >> Pulp might serve just the API and the metadata, the content
    might be
    >>> >> redirected to the Proxy all the time, correct?
    >>> >> Doesn't Crane do that btw?
    >>> >
    >>> >
    >>> > Theoretically we could do this, but in practice we would run
    into
    >>> > problems
    >>> > when we needed to scale out the Content app. Right now when
    the Content
    >>> > app
    >>> > needs to be scaled, a user can launch another machine that
    will run the
    >>> > Content app. Squid does not support that kind of scaling.
    Squid can
    >>> > only
    >>> > take advantage of additional cores in a single machine
    >>>
    >>> I don't think I understand; proxies are actually designed to
    scale[1]
    >>> and are used as tools to scale the web too.
    >>>
    >>> This is all about the How question but when it comes to my
    original
    >>> Why, please correct me if I'm being wrong, the answer so far
    has been:
    >>>  Pulp always downloads the content because that's what it is
    supposed to
    >>> do.
    >>>
    >>> Cheers,
    >>> milan
    >>>
    >>> [1] https://en.wikipedia.org/wiki/Reverse_proxy
    <https://en.wikipedia.org/wiki/Reverse_proxy>
    >>> [2]
    https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA
    <https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA>
    >>> [3]
    >>>
    
https://wiki.squid-cache.org/Features/CacheHierarchy?highlight=%28faqlisted.yes%29
    
<https://wiki.squid-cache.org/Features/CacheHierarchy?highlight=%28faqlisted.yes%29>
    >>>
    >>> >
    >>> >>
    >>> >>
    >>> >> Cheers,
    >>> >> milan
    >>> >>
    >>> >> >
    >>> >> > -dennis
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> >
    >>> >> >>
    >>> >> >>
    >>> >> >> --
    >>> >> >> cheers
    >>> >> >> milan
    >>> >> >>
    >>> >> >> On Tue, May 29, 2018 at 4:25 PM, Brian Bouterse
    >>> >> >> <[email protected] <mailto:[email protected]>>
    >>> >> >> wrote:
    >>> >> >> >
    >>> >> >> > On Mon, May 28, 2018 at 9:57 AM, Milan Kovacik
    >>> >> >> > <[email protected] <mailto:[email protected]>>
    >>> >> >> > wrote:
    >>> >> >> >>
    >>> >> >> >> Hi,
    >>> >> >> >>
    >>> >> >> >> Looking at the diagram[1] I'm wondering what's the
    reasoning
    >>> >> >> >> behind
    >>> >> >> >> Pulp having to actually fetch the content locally?
    >>> >> >> >
    >>> >> >> >
    >>> >> >> > Is the question "why is Pulp doing the fetching and
    not Squid?"
    >>> >> >> > or
    >>> >> >> > "why
    >>> >> >> > is
    >>> >> >> > Pulp storing the content after fetching it?" or both?
    >>> >> >> >
    >>> >> >> >> Couldn't Pulp just rely on the proxy with regards to
    the content
    >>> >> >> >> streaming?
    >>> >> >> >>
    >>> >> >> >> Thanks,
    >>> >> >> >> milan
    >>> >> >> >>
    >>> >> >> >>
    >>> >> >> >> [1] https://pulp.plan.io/attachments/130957
    <https://pulp.plan.io/attachments/130957>
    >>> >> >> >>
    >>> >> >> >> On Fri, May 25, 2018 at 9:11 PM, Brian Bouterse
    >>> >> >> >> <[email protected] <mailto:[email protected]>>
    >>> >> >> >> wrote:
    >>> >> >> >> > A mini-team of core devs** met to talk through lazy
    use cases
    >>> >> >> >> > for
    >>> >> >> >> > Pulp3.
    >>> >> >> >> > It's effectively the same lazy from Pulp2 except:
    >>> >> >> >> >
    >>> >> >> >> > * it's now built into core (not just RPM)
    >>> >> >> >> > * It disincludes repo protection use cases because
    we haven't
    >>> >> >> >> > added
    >>> >> >> >> > repo
    >>> >> >> >> > protection to Pulp3 yet
    >>> >> >> >> > * It disincludes the "background" policy which based on
    >>> >> >> >> > feedback
    >>> >> >> >> > from
    >>> >> >> >> > stakeholders provided very little value
    >>> >> >> >> > * it will no longer will depend on Twisted as a
    dependency. It
    >>> >> >> >> > will
    >>> >> >> >> > use
    >>> >> >> >> > asyncio instead.
    >>> >> >> >> >
    >>> >> >> >> > While it is being built into core, it will require
    minimal
    >>> >> >> >> > support
    >>> >> >> >> > by
    >>> >> >> >> > a
    >>> >> >> >> > plugin writer to add support for it. Details in the
    epic
    >>> >> >> >> > below.
    >>> >> >> >> >
    >>> >> >> >> > The current use cases along with a technical plan
    are written
    >>> >> >> >> > on
    >>> >> >> >> > this
    >>> >> >> >> > epic:
    >>> >> >> >> > https://pulp.plan.io/issues/3693
    <https://pulp.plan.io/issues/3693>
    >>> >> >> >> >
    >>> >> >> >> > We're putting it out for comment, questions, and
    feedback
    >>> >> >> >> > before
    >>> >> >> >> > we
    >>> >> >> >> > start
    >>> >> >> >> > into the code. I hope we are able to add this into
    our next
    >>> >> >> >> > sprint.
    >>> >> >> >> >
    >>> >> >> >> > ** ipanova, jortel, ttereshc, dkliban, bmbouter
    >>> >> >> >> >
    >>> >> >> >> > Thanks!
    >>> >> >> >> > Brian
    >>> >> >> >> >
    >>> >> >> >> >
    >>> >> >> >> > _______________________________________________
    >>> >> >> >> > Pulp-dev mailing list
    >>> >> >> >> > [email protected] <mailto:[email protected]>
    >>> >> >> >> > https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>
    >>> >> >> >> >
    >>> >> >> >
    >>> >> >> >
    >>> >> >>
    >>> >> >> _______________________________________________
    >>> >> >> Pulp-dev mailing list
    >>> >> >> [email protected] <mailto:[email protected]>
    >>> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>
    >>> >> >
    >>> >> >
    >>> >
    >>> >
    >>>
    >>> _______________________________________________
    >>> Pulp-dev mailing list
    >>> [email protected] <mailto:[email protected]>
    >>> https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>
    >>
    >>
    >>
    >> _______________________________________________
    >> Pulp-dev mailing list
    >> [email protected] <mailto:[email protected]>
    >> https://www.redhat.com/mailman/listinfo/pulp-dev
    <https://www.redhat.com/mailman/listinfo/pulp-dev>
    >>
    >




_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Re: [Pulp-dev] Lazy for Pulp3

Reply via email to