Glad to see this moving forward. Keeping the lightweight CI-related work in
the Hive repository sounds good to me.

A small concern: for image-related changes, PR CI may still use the
existing image, so it may not fully validate the change itself. We could
consider using a temporarily built image for such PR when defining the CI
workflow.

Best regards,
Wechar Yu


On Tue, May 5, 2026 at 9:08 PM Stamatis Zampetakis <[email protected]>
wrote:

> Personally, I favor keeping CI/infra code under the main Hive repo.
> The initiative for a separate repo stemmed from concerns raised during
> the review HIVE-28339 but that was a while ago so the situation might
> be different now. Based on my personal preference and what has been
> expressed so far I will reformulate the proposal.
>
> Any code designated for Infra/CI can live in the main Hive repository
> (possibly under a new module or directory). If there are no objections
> in the following days, we can adopt this contribution model for
> HIVE-29591, HIVE-27382, HIVE-28339 and any other related Jira.
>
> For datasets, especially in light of HIVE-27382 and HIVE-26830, my
> concrete proposal is to move
> https://github.com/zabetak/hive-test-datasets under the Apache
> namespace. In other words, create
> https://github.com/apache/hive-test-datasets and publish the DB dumps
> there. Since these are large, binary, non-source files that rarely
> change, putting them under version control doesn't make much sense.
> Therefore, I propose publishing them as release assets and fetching
> them from there as illustrated in [1]. If there are no objections or
> better ideas within the next 72 hours, I will proceed with creating
> the new repo.
>
> Best,
> Stamatis
>
> [1]
> https://github.com/zabetak/hive-postgres-metastore/blob/45ce9f3c28093069f0627adf7f8d9a9ec76299ef/Dockerfile#L3
>
> On Tue, May 5, 2026 at 8:20 AM László Bodor <[email protected]>
> wrote:
> >
> > Hey team!
> >
> > Thanks, Stamatis, for initiating this thread. I hope we can go further
> this time than last time.
> >
> >> 1. Are there objections in creating a new Git repo under the
> apache/hive namespace?
> >>
> >> 2. What name would you prefer?
> >
> >
> > I can answer both at the same time. I prefer maintaining infra code in
> the hive repo, especially as long as it is no more than a few files.
> > This applies to what you were referring to as hive-ci. As I mentioned on
> HIVE-29591, hive ci code basically is nothing more than a Dockerfile,
> considering that originally, hive-dev-box covered a way more than we
> actually need. I'm ready to provide a vanilla precommit image for this
> purpose.
> >
> > Regarding: hive-infra, hive-datasets, I don't have a strong opinion.
> >
> > I think hive-infra is also better kept in the hive repository. The only
> thing we might want to take care of is not triggering a full pre-commit
> each time infra code is pushed to the repo, because it won't test anything
> (since infra code is not deployed to the GCP project in the PR scope).
> >
> > Regarding hive-datasets: I agree that huge raw data or dumps cannot be
> part of the Hive repository, so a separate apache/hive-datasests would
> suffice, we need to just mention it in our Docker README, and it's done :)
> >
> https://github.com/apache/hive/blob/master/packaging/src/docker/README.md
> >
> >
> > Regards,
> > Laszlo Bodor
> >
> >
> >
> > On Mon, 4 May 2026 at 09:32, Stamatis Zampetakis <[email protected]>
> wrote:
> >>
> >> Hey team,
> >>
> >> Given the recent activity under HIVE-29590 [1], I would like to revive
> this discussion about creating a dedicated Git repository for
> ci/test/dataset related stuff. Our lack of reactivity on this topic makes
> our whole test/ci infrastructure depend on personal/user specific
> repositories. This is not aligned with the ASF way and and makes us depend
> too much on individual users/contributors leading to a single point of
> failure.
> >>
> >> The lack of dedicated repo blocked various useful contributions in the
> past (e.g., [2]) that became stale and eventually were closed without
> action.
> >>
> >> Summing up I have two questions:
> >> 1. Are there objections in creating a new Git repo under the
> apache/hive namespace?
> >> 2. What name would you prefer?
> >> * https://github.com/apache/hive-datasets
> >> * https://github.com/apache/hive-ci
> >> * https://github.com/apache/hive-infra
> >>
> >> At the moment that main things that we want to put there is everything
> under HIVE-29590, HIVE-26830, and HIVE-28339.
> >>
> >> Best,
> >> Stamatis
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-29590
> >> [2] https://lists.apache.org/thread/4qb3z3yx9ovnxbsr4b02ohz6twlkrlx9
> >>
> >> On 2025/10/24 12:22:12 Stamatis Zampetakis wrote:
> >> > Thanks for starting the discussion Thomas!
> >> >
> >> > In fact, I would go one step further and instead of storing the
> >> > dumps/dockerfiles in personal git repositories such as [1] to create
> >> > an apache git repo for that purpose:
> >> > https://github.com/apache/hive-datasets
> >> > I know that git is not the perfect place to store large files but I
> >> > feel that moving from a personal managed repo to a community managed
> >> > repo is something worth doing.
> >> > Subsequently, having also a corresponding namespace in Docker Hub
> >> > makes sense to me.
> >> >
> >> > Best,
> >> > Stamatis
> >> >
> >> > [1] https://github.com/zabetak/hive-postgres-metastore
> >> >
> >> > On Fri, Oct 24, 2025 at 12:10 PM Thomas Rebele <
> [email protected]> wrote:
> >> > >
> >> > > Hi Hive community,
> >> > >
> >> > > I'm working on creating a docker image for a TPC-DS 30TB metastore
> with histogram statistics [HIVE-26830](
> https://issues.apache.org/jira/browse/HIVE-26830).
> >> > >
> >> > > The previous TPC-DS metastore docker images have been published at
> https://hub.docker.com/r/zabetak/postgres-tpcds-metastore. Stamatis
> suggested to create a repo under https://hub.docker.com/u/apache, maybe
> called "hive-dataset".
> >> > >
> >> > > What do you think about this approach?
> >> > >
> >> > > Best regards,
> >> > > Thomas Rebele
> >> > >
> >> >
>

Reply via email to