Personally, I favor keeping CI/infra code under the main Hive repo. The initiative for a separate repo stemmed from concerns raised during the review HIVE-28339 but that was a while ago so the situation might be different now. Based on my personal preference and what has been expressed so far I will reformulate the proposal.
Any code designated for Infra/CI can live in the main Hive repository (possibly under a new module or directory). If there are no objections in the following days, we can adopt this contribution model for HIVE-29591, HIVE-27382, HIVE-28339 and any other related Jira. For datasets, especially in light of HIVE-27382 and HIVE-26830, my concrete proposal is to move https://github.com/zabetak/hive-test-datasets under the Apache namespace. In other words, create https://github.com/apache/hive-test-datasets and publish the DB dumps there. Since these are large, binary, non-source files that rarely change, putting them under version control doesn't make much sense. Therefore, I propose publishing them as release assets and fetching them from there as illustrated in [1]. If there are no objections or better ideas within the next 72 hours, I will proceed with creating the new repo. Best, Stamatis [1] https://github.com/zabetak/hive-postgres-metastore/blob/45ce9f3c28093069f0627adf7f8d9a9ec76299ef/Dockerfile#L3 On Tue, May 5, 2026 at 8:20 AM László Bodor <[email protected]> wrote: > > Hey team! > > Thanks, Stamatis, for initiating this thread. I hope we can go further this > time than last time. > >> 1. Are there objections in creating a new Git repo under the apache/hive >> namespace? >> >> 2. What name would you prefer? > > > I can answer both at the same time. I prefer maintaining infra code in the > hive repo, especially as long as it is no more than a few files. > This applies to what you were referring to as hive-ci. As I mentioned on > HIVE-29591, hive ci code basically is nothing more than a Dockerfile, > considering that originally, hive-dev-box covered a way more than we actually > need. I'm ready to provide a vanilla precommit image for this purpose. > > Regarding: hive-infra, hive-datasets, I don't have a strong opinion. > > I think hive-infra is also better kept in the hive repository. The only thing > we might want to take care of is not triggering a full pre-commit each time > infra code is pushed to the repo, because it won't test anything (since infra > code is not deployed to the GCP project in the PR scope). > > Regarding hive-datasets: I agree that huge raw data or dumps cannot be part > of the Hive repository, so a separate apache/hive-datasests would suffice, we > need to just mention it in our Docker README, and it's done :) > https://github.com/apache/hive/blob/master/packaging/src/docker/README.md > > > Regards, > Laszlo Bodor > > > > On Mon, 4 May 2026 at 09:32, Stamatis Zampetakis <[email protected]> wrote: >> >> Hey team, >> >> Given the recent activity under HIVE-29590 [1], I would like to revive this >> discussion about creating a dedicated Git repository for ci/test/dataset >> related stuff. Our lack of reactivity on this topic makes our whole test/ci >> infrastructure depend on personal/user specific repositories. This is not >> aligned with the ASF way and and makes us depend too much on individual >> users/contributors leading to a single point of failure. >> >> The lack of dedicated repo blocked various useful contributions in the past >> (e.g., [2]) that became stale and eventually were closed without action. >> >> Summing up I have two questions: >> 1. Are there objections in creating a new Git repo under the apache/hive >> namespace? >> 2. What name would you prefer? >> * https://github.com/apache/hive-datasets >> * https://github.com/apache/hive-ci >> * https://github.com/apache/hive-infra >> >> At the moment that main things that we want to put there is everything under >> HIVE-29590, HIVE-26830, and HIVE-28339. >> >> Best, >> Stamatis >> >> [1] https://issues.apache.org/jira/browse/HIVE-29590 >> [2] https://lists.apache.org/thread/4qb3z3yx9ovnxbsr4b02ohz6twlkrlx9 >> >> On 2025/10/24 12:22:12 Stamatis Zampetakis wrote: >> > Thanks for starting the discussion Thomas! >> > >> > In fact, I would go one step further and instead of storing the >> > dumps/dockerfiles in personal git repositories such as [1] to create >> > an apache git repo for that purpose: >> > https://github.com/apache/hive-datasets >> > I know that git is not the perfect place to store large files but I >> > feel that moving from a personal managed repo to a community managed >> > repo is something worth doing. >> > Subsequently, having also a corresponding namespace in Docker Hub >> > makes sense to me. >> > >> > Best, >> > Stamatis >> > >> > [1] https://github.com/zabetak/hive-postgres-metastore >> > >> > On Fri, Oct 24, 2025 at 12:10 PM Thomas Rebele <[email protected]> >> > wrote: >> > > >> > > Hi Hive community, >> > > >> > > I'm working on creating a docker image for a TPC-DS 30TB metastore with >> > > histogram statistics >> > > [HIVE-26830](https://issues.apache.org/jira/browse/HIVE-26830). >> > > >> > > The previous TPC-DS metastore docker images have been published at >> > > https://hub.docker.com/r/zabetak/postgres-tpcds-metastore. Stamatis >> > > suggested to create a repo under https://hub.docker.com/u/apache, maybe >> > > called "hive-dataset". >> > > >> > > What do you think about this approach? >> > > >> > > Best regards, >> > > Thomas Rebele >> > > >> >
