Hi Pulsar Community, Here are the meeting notes from Thursday's community meeting. Thanks to all who participated!
Disclaimer: if something is misattributed or misrepresented, please send a correction to this list. Source google doc: https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE Thanks, Michael 2022/09/15, (8:30 AM PST) - Attendees: - Matteo Merli - Christophe Bornet - Asaf Mesika - Qiang Zhao - Heesung Sohn - Andrey Yegorov - Nicolò Boschi - Ayman Khalil - Dave Fisher - Lari Hotari - Michael Marshall - Discussions/PIPs/PRs (Generally discussed in order they appear) - Pulsar CI status and proposed change to address resource shortage. We’ve had slowness in the past 3 or 4 weeks. Lari opened a GitHub support ticket, took a week to get a response, but it wasn’t very helpful. It seems our resources are lower. Lari made a proposal on ML to update the workflow. The summary: when a PR is opened, we only run a quick/light check for the PR and not proceed further. We then add a ready to test label on a PR before we run the later tests. PR authors can run tests in their own fork if they are relying on it for tests. When reviewing, a committer can provide instructions/documentation to the contributor. When tests are passing, we would add a comment to tell the pulsar bot to make the tests run. Matteo: it’s unfortunate because it complicates the flow. Could we defer testing until PRs are approved? Lari: the problem is that we need the test feedback to approve it. That’s why the contributor should run the tests on their own side. Matteo: we should only merge when tests pass, approval isn’t the same as tests passing. Lari: the idea is that the user would be running the tests in their own fork. Qiang Zhao: Maybe we can run the quick test first like checkstyle, sportbugs etc. after PR is approved we can continue other tests. Dave: what if they do not have all of the commits in their fork? This is a moving target. Michael: that is currently the case in the apache/pulsar repo too. Dave: what if we only run certain tests for certain use cases. Lari: we are trying to address that by moving certain modules out of the project. Matteo: I think we should move the c++ and python client to a separate repo. Michael: that would be helpful for release management and for letting c++ move at its own pace. We would need to make it work for the docker image which has the python wheels. Matteo: we could make it work by only shipping with released versions of the python client. Michael: that would definitely speed up releases for apache/pulsar and the build for the docker image. Matteo: it would also reduce resource utilization on the ci because we wouldn’t build the the c++. Lari: the proposal to have contributors build their project in their repo is important to reduce resource utilization. Apache Spark requires this already. Matteo: it could help to have automation that does reporting of build results in other repos. Coming back to my earlier point, why not gate tests on PR approval? Lari: this feature is available out of the box for non-write access users to the repo. Why not use the label solution? Matteo: I see the compiler/test validation to be separate from human validation/approval. Additionally, if the tests are passing in their fork why do we not just trust their tests? Lari: I think there could be issues with recency to verify that the other tests are the right tests because you wouldn’t want the other test to be for an old version of master. Matteo: as long as it is the same branch, then it will be the same tests that we would run if we ran it in the apache/pulsar repo. Lari: it’s good to be sure that the pipeline is the same in our repo. Matteo: the pipeline has to be the same because it is tied to the branch. It is not that I have a different configuration in my fork. There will be a diff visible in the PR that shows any points of divergence. A bot could guide users for this process and then also verify that the other repo’s tests pass. Lari: that is doable, one way to do that is that the bot creates a link to create that PR. Matteo: I’m fine with that. After that, we should use the bot to track the status of the other repos test results and report back to the PR. Lari: this will reduce resource consumption, so it could be faster to start without the bot reporting back. Matteo: I don’t think the resource constraints will go away soon. Lari: if we’re able to cut 60% of the current consumption, we’ll be fine and won’t have any delays. There is also work to move Pulsar SQL out of the main repo. Dave: do we need an official PIP for that? Matteo: yes, I think so. Asaf: because the protocol is not thin, do we risk conflicts for the other client? Matteo: the advantage right now is that if you break the protocol in Java, it will break C++ tests. The reverse is that we have a humongous repo. If we break the protocol, well, that is why we require PIPs for protocol changes to get eyes on the changes. The same is with the Go client, which is outside. Dave: node js too. Michael: the cost of the c++ build is high and we don’t change the protocol that often. Matteo: we could use a nightly job to build the pulsar docker image and then have the go and c++ client test against that nightly job. That would help catch issues sooner. Dave: you could also just run those tests against the latest release. Matteo: the problem is that it doesn’t guarantee that we catch things before upgrading. If we run against master, we’ll discover earlier. Dave: why not do both test against latest release and current master? Matteo: sure. Dave: that would let us do regression tests. I like the idea of nightly integrations. Matteo: we should publish nightly images for the broker and use those. Lari: one note about pulsar io and pulsar sql, we’ll need to be careful to make sure tests are run similarly to the client tests just discussed. We’ll need a way to build the pulsar-all docker image. There is currently https://github.com/apache/pulsar-release. Matteo: the all image is a bit too much, often. I’d rather keep the sql out of the image. With the connectors, it is practical to have all of them, but it’s also a 3 gb image. We could give users a way to easily install connectors. Michael: 3 gb is definitely big. Matteo: there are definitely trade offs. - Michael: would it be valuable to move away from the stale label to use a label that indicates that a PR needs reviews? Then, a label can make it clear whether a PR needs reviews or if the contributor needs to make changes. Matteo: it could help to make it clearer. There is a lot of duplication of effort, and we want multiple reviewers, but we also want to distribute time well to review many PRs. Michael: similarly issues do not always get responses. We could try to coordinate on this by asking for volunteers and scheduling. - Michael: sharing an observation about watching the zookeeper /. The bookkeeper watches all of these when it gets its lock own its /ledgers/available/bookie-id. The bookie doesn’t need any other notifications for that client. I know there are reasons we’ve identified for wanting to use the persistent recursive watch on /, but it seems like a lot of overhead for the single bookie client. Matteo: there are a lot of complexity for this. The main cost is getting notifications and then checking a hash map to see if it cares about the update. That should be a fast check. The bookie will cache the list of ledgers. Before it was always getting those from zookeeper, but now the caching decreases the network utilization. Before it was doing this lookup very frequently. It’s not just the bookie. Michael: I thought the bookie had multiple sessions based on an observation. Matteo: a bookie should only have one session. Auto recovery might be adding extra ones. We might be able to improve the bookie by only having a watch on the `/ledger` znode to cut off a bunch of the other traffic.