Thanks for all the feedback and discussion. I'll try to give some responses.
I should have sent over this technical outline yesterday. I want to dig deeper into seeing if RQ has the features Pulp needs or not. * RQ has a worker discovery mechanism built-in already in the heartbeat() <https://github.com/rq/rq/blob/master/rq/worker.py#L534> method. We can override this heartbeat method to have heartbeats run our heartbeat code <https://github.com/pulp/pulp/blob/3.0-dev/pulpcore/pulpcore/tasking/celery_app.py#L94-L114> also, the same way to write the worker records to the DB like they do today. * We can hook the Task status transitions when tasks start, succeed, or fail with a custom worker subclass here <https://github.com/rq/rq/blob/master/rq/worker.py#L737>. This will handle task state transitions similar to how we do today with Celery. Custom worker objects are supported, see "Custom Worker classes" on this page <http://python-rq.org/docs/workers/>. * We can cancel and/or delete a task (RQ calls a job) using these cancel() or delete() instance methods <https://github.com/rq/rq/blob/master/rq/job.py#L522-L577> or the helper function cancel_job() <https://github.com/rq/rq/blob/master/rq/job.py#L58> which does it by job ID. I'm not sure why this isn't clearly shown in the docs for RQ. This looks equivalent to what we do with Celery. * We can have Task records get created when RQ jobs are dispatched using a custom Job object as well which is supported, see "Custom Job classes" on this page <http://python-rq.org/docs/workers/> * We can have code that detects which task id it's running in <https://github.com/pulp/pulp/blob/3.0-dev/pulpcore/pulpcore/tasking/util.py#L82> use this RQ method that does the same <https://github.com/rq/rq/blob/master/rq/job.py#L75>. * Also RQ makes the key assumption that Pulp needs which is that each worker only processes 1 thing. From the RQ docs: "Each worker will process a single job at a time. Within a worker, there is no concurrent processing going on." * plugin writer's can make their RQ jobs just like they made Celery jobs. They work in an equivalent way. I'm hoping to first determine if we think there are gaps. Some help on validating ^ would be good. If there are no gaps discovered, I estimate a prototype PR could be made within a day or two. I think it could be that easy. In terms of what library to pick, what I like about RQ is that it seems active, healthy, and mature. It's over 5 year old with 124 committers, 44 releases, 111 open issues, 430 closed issues, and daily commits. I think TaskTiger is also a fine choice and very similar but isn't as old and established. I have not looked into Kuyruk or Dramatiq because I want to get off of RabbitMQ altogether not just Celery. If we support RabbitMQ we also have to support Qpid which creates issues for users since they work pretty differently. I didn't state that before so it's probably good I state that too. Please send more ideas, questions, and concerns! -Brian On Wed, Mar 21, 2018 at 8:13 AM, Ina Panova <ipan...@redhat.com> wrote: > +1 what said dalley. > > Whatever we'd decide to replace celery with, should not go before beta > that's for sure. > I am +10000 to get rid of celery, but with something that would not have > other limitations which would bring just different kind of pain. [0] > Let's keep searching and evaluating alternatives. > > [0] https://www.youtube.com/watch?v=Qmhc7tZ6ElQ > > > > -------- > Regards, > > Ina Panova > Software Engineer| Pulp| Red Hat Inc. > > "Do not go where the path may lead, > go instead where there is no path and leave a trail." > > On Tue, Mar 20, 2018 at 9:52 PM, Daniel Alley <dal...@redhat.com> wrote: > >> Another option is TaskTiger (https://github.com/closeio/tasktiger) which >> really hooked me with their tagline. >> >> But I really just don't see how we could pull this off responsibly in the >> next month (or even 3 months). Assuming the functionality gaps can be >> worked out, it then becomes a question of whether that amount of change >> would be acceptable in the interim period between betas. >> >> On Tue, Mar 20, 2018 at 4:39 PM, Daniel Alley <dal...@redhat.com> wrote: >> >>> As Brian said, Celery has a lot of limitations and drawbacks, a lot of >>> code complexity, and an upstream that is not terribly responsive. I, too, >>> would love to see us move away from Celery at some point. >>> >>> But having done a little bit of research over the last few hours since >>> it was mentioned, I have some concerns about the gaps between Celery and >>> RQ, and I don't think that changing Pulp to use RQ would be as trivial as >>> we hope. >>> >>> I'll start with the benefits of RQ, from what I've read so far. >>> >>> >>> - It has task prioritization that *actually works*, which would help >>> resolve the issue where reserved resource work tasks get choked out by >>> less important tasks like applicability. The officially recommended >>> solution that Celery provides for this is... have dedicated workers for >>> each priority level. Not ideal. >>> - The documentation is pretty good, from what I can tell. The >>> Celery documentation is usually OK but sometimes... lacking. >>> - RQ is a lot more straightforwards and less complex to use, from >>> what I can tell >>> >>> But, problems: >>> >>> - RQ does not support revoking tasks. If you send the worker a >>> SIGINT, it will finish the task and then stop processing new ones. If >>> you >>> send the worker SIGKILL, it will stop immediately, but I don't think it >>> gracefully handles this circumstance. >>> - People have rolled their own revoke functionality, but we >>> should really look at this. >>> - When a RQ task fails, it does not provide a mechanism to >>> automatically run a piece of code. It puts the task on a "failed" queue >>> and the python handle for it will have is_failed set to True. this means >>> we would have to redesign how failed tasks are cleaned up >>> - I have no idea what happens when RQ loses connection to Redis, I >>> couldn't find that info anywhere. Celery (in theory, at least, reality >>> is >>> mushy) will try to reconnect to the broker. >>> - I have no idea how well RQ deals with persistence >>> >>> Also... we have shaped large parts of our API around what Celery does. >>> Undoing this would be very... nontrivial and I don't think it is possible >>> before the beta date, and definitely not if we want to guarantee some level >>> of stability. >>> >>> I'll keep looking but as much as I despise working with Celery I don't >>> think we can make this move without a lot more research to make sure these >>> problems are solvable. >>> >>> On Tue, Mar 20, 2018 at 4:03 PM, Austin Macdonald <aus...@redhat.com> >>> wrote: >>> >>>> Not being familiar with RQ, I have questions (but no opinion). >>>> >>>> Will we also be replacing RabbitMQ with Redis? >>>> Does anyone on the team have experience with RQ? In production? >>>> How well does RQ scale? >>>> Is RQ's use of `pickle` a problem? https://pulp.plan.io/issues/23 >>>> RQ doesn't work on Windows. Is that a problem? (jk) >>>> >>>> >>>> On Tue, Mar 20, 2018 at 3:35 PM, Brian Bouterse <bbout...@redhat.com> >>>> wrote: >>>> >>>>> Motivation: >>>>> 1. Celery causes many bugs and issues for Pulp2 and 3 users and there >>>>> is no end in sight. >>>>> >>>>> 2. The Pulp core team spends a lot of effort fixing Celery bugs. It's >>>>> often times just us doing it with little or no assistance from the >>>>> upstream >>>>> communities. It's across 4 projects: celery, kombu, billiard, and pyamqp. >>>>> >>>>> 3. Celery will never allow a coverage report to be generated when >>>>> pulp-smash runs because Celery forked the multiprocessing library into >>>>> something called billiard. This will limit Pulp forever. >>>>> >>>> >>>>> 4. I don't want to work with Celery anymore and I think the other >>>>> maintainers (@dalley, @daviddavis) may feel the same. It's an endless >>>>> headache. Even basic things don't work in Celery regularly. >>>>> >>>>> Proposed change: Replace Pulp3's usage of Celery with RQ ( >>>>> http://python-rq.org/) >>>>> >>>>> We would keep the exact same design of a resource manager with n >>>>> workers, each worker pulling it's work exclusively from a dedicated queue. >>>>> I've looked into porting pulp3 to it and it's doable because all the same >>>>> concepts are there. There are a few details to work out, but I wanted to >>>>> start the "should we" discussion before we do all-out technical planning. >>>>> >>>>> When would we do this? I'm proposing soon. It doesn't need to block >>>>> the beta, but soon would be good. I don't think users will care much >>>>> except >>>>> for their systemd files, but it is fundamental and important to pulp3 so >>>>> we >>>>> want to get it testing sooner. >>>>> >>>>> Ideas, comments, questions are welcome! >>>>> >>>>> Thanks, >>>>> Brian >>>>> >>>>> _______________________________________________ >>>>> Pulp-dev mailing list >>>>> Pulp-dev@redhat.com >>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> Pulp-dev@redhat.com >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>>> >>> >> >> _______________________________________________ >> Pulp-dev mailing list >> Pulp-dev@redhat.com >> https://www.redhat.com/mailman/listinfo/pulp-dev >> >> >
_______________________________________________ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev