Glad to see this moving forward. Keeping the lightweight CI-related work in the Hive repository sounds good to me.
A small concern: for image-related changes, PR CI may still use the existing image, so it may not fully validate the change itself. We could consider using a temporarily built image for such PR when defining the CI workflow. Best regards, Wechar Yu On Tue, May 5, 2026 at 9:08 PM Stamatis Zampetakis <[email protected]> wrote: > Personally, I favor keeping CI/infra code under the main Hive repo. > The initiative for a separate repo stemmed from concerns raised during > the review HIVE-28339 but that was a while ago so the situation might > be different now. Based on my personal preference and what has been > expressed so far I will reformulate the proposal. > > Any code designated for Infra/CI can live in the main Hive repository > (possibly under a new module or directory). If there are no objections > in the following days, we can adopt this contribution model for > HIVE-29591, HIVE-27382, HIVE-28339 and any other related Jira. > > For datasets, especially in light of HIVE-27382 and HIVE-26830, my > concrete proposal is to move > https://github.com/zabetak/hive-test-datasets under the Apache > namespace. In other words, create > https://github.com/apache/hive-test-datasets and publish the DB dumps > there. Since these are large, binary, non-source files that rarely > change, putting them under version control doesn't make much sense. > Therefore, I propose publishing them as release assets and fetching > them from there as illustrated in [1]. If there are no objections or > better ideas within the next 72 hours, I will proceed with creating > the new repo. > > Best, > Stamatis > > [1] > https://github.com/zabetak/hive-postgres-metastore/blob/45ce9f3c28093069f0627adf7f8d9a9ec76299ef/Dockerfile#L3 > > On Tue, May 5, 2026 at 8:20 AM László Bodor <[email protected]> > wrote: > > > > Hey team! > > > > Thanks, Stamatis, for initiating this thread. I hope we can go further > this time than last time. > > > >> 1. Are there objections in creating a new Git repo under the > apache/hive namespace? > >> > >> 2. What name would you prefer? > > > > > > I can answer both at the same time. I prefer maintaining infra code in > the hive repo, especially as long as it is no more than a few files. > > This applies to what you were referring to as hive-ci. As I mentioned on > HIVE-29591, hive ci code basically is nothing more than a Dockerfile, > considering that originally, hive-dev-box covered a way more than we > actually need. I'm ready to provide a vanilla precommit image for this > purpose. > > > > Regarding: hive-infra, hive-datasets, I don't have a strong opinion. > > > > I think hive-infra is also better kept in the hive repository. The only > thing we might want to take care of is not triggering a full pre-commit > each time infra code is pushed to the repo, because it won't test anything > (since infra code is not deployed to the GCP project in the PR scope). > > > > Regarding hive-datasets: I agree that huge raw data or dumps cannot be > part of the Hive repository, so a separate apache/hive-datasests would > suffice, we need to just mention it in our Docker README, and it's done :) > > > https://github.com/apache/hive/blob/master/packaging/src/docker/README.md > > > > > > Regards, > > Laszlo Bodor > > > > > > > > On Mon, 4 May 2026 at 09:32, Stamatis Zampetakis <[email protected]> > wrote: > >> > >> Hey team, > >> > >> Given the recent activity under HIVE-29590 [1], I would like to revive > this discussion about creating a dedicated Git repository for > ci/test/dataset related stuff. Our lack of reactivity on this topic makes > our whole test/ci infrastructure depend on personal/user specific > repositories. This is not aligned with the ASF way and and makes us depend > too much on individual users/contributors leading to a single point of > failure. > >> > >> The lack of dedicated repo blocked various useful contributions in the > past (e.g., [2]) that became stale and eventually were closed without > action. > >> > >> Summing up I have two questions: > >> 1. Are there objections in creating a new Git repo under the > apache/hive namespace? > >> 2. What name would you prefer? > >> * https://github.com/apache/hive-datasets > >> * https://github.com/apache/hive-ci > >> * https://github.com/apache/hive-infra > >> > >> At the moment that main things that we want to put there is everything > under HIVE-29590, HIVE-26830, and HIVE-28339. > >> > >> Best, > >> Stamatis > >> > >> [1] https://issues.apache.org/jira/browse/HIVE-29590 > >> [2] https://lists.apache.org/thread/4qb3z3yx9ovnxbsr4b02ohz6twlkrlx9 > >> > >> On 2025/10/24 12:22:12 Stamatis Zampetakis wrote: > >> > Thanks for starting the discussion Thomas! > >> > > >> > In fact, I would go one step further and instead of storing the > >> > dumps/dockerfiles in personal git repositories such as [1] to create > >> > an apache git repo for that purpose: > >> > https://github.com/apache/hive-datasets > >> > I know that git is not the perfect place to store large files but I > >> > feel that moving from a personal managed repo to a community managed > >> > repo is something worth doing. > >> > Subsequently, having also a corresponding namespace in Docker Hub > >> > makes sense to me. > >> > > >> > Best, > >> > Stamatis > >> > > >> > [1] https://github.com/zabetak/hive-postgres-metastore > >> > > >> > On Fri, Oct 24, 2025 at 12:10 PM Thomas Rebele < > [email protected]> wrote: > >> > > > >> > > Hi Hive community, > >> > > > >> > > I'm working on creating a docker image for a TPC-DS 30TB metastore > with histogram statistics [HIVE-26830]( > https://issues.apache.org/jira/browse/HIVE-26830). > >> > > > >> > > The previous TPC-DS metastore docker images have been published at > https://hub.docker.com/r/zabetak/postgres-tpcds-metastore. Stamatis > suggested to create a repo under https://hub.docker.com/u/apache, maybe > called "hive-dataset". > >> > > > >> > > What do you think about this approach? > >> > > > >> > > Best regards, > >> > > Thomas Rebele > >> > > > >> > >
