Re: [DISCUSS] New Git repo for Docker images, datasets/dumps, CI stuff

Stamatis Zampetakis Tue, 05 May 2026 06:08:37 -0700

Personally, I favor keeping CI/infra code under the main Hive repo.
The initiative for a separate repo stemmed from concerns raised during
the review HIVE-28339 but that was a while ago so the situation might
be different now. Based on my personal preference and what has been
expressed so far I will reformulate the proposal.


Any code designated for Infra/CI can live in the main Hive repository
(possibly under a new module or directory). If there are no objections
in the following days, we can adopt this contribution model for
HIVE-29591, HIVE-27382, HIVE-28339 and any other related Jira.

For datasets, especially in light of HIVE-27382 and HIVE-26830, my
concrete proposal is to move
https://github.com/zabetak/hive-test-datasets under the Apache
namespace. In other words, create
https://github.com/apache/hive-test-datasets and publish the DB dumps
there. Since these are large, binary, non-source files that rarely
change, putting them under version control doesn't make much sense.
Therefore, I propose publishing them as release assets and fetching
them from there as illustrated in [1]. If there are no objections or
better ideas within the next 72 hours, I will proceed with creating
the new repo.

Best,
Stamatis

[1] 
https://github.com/zabetak/hive-postgres-metastore/blob/45ce9f3c28093069f0627adf7f8d9a9ec76299ef/Dockerfile#L3

On Tue, May 5, 2026 at 8:20 AM László Bodor <[email protected]> wrote:
>
> Hey team!
>
> Thanks, Stamatis, for initiating this thread. I hope we can go further this 
> time than last time.
>
>> 1. Are there objections in creating a new Git repo under the apache/hive 
>> namespace?
>>
>> 2. What name would you prefer?
>
>
> I can answer both at the same time. I prefer maintaining infra code in the 
> hive repo, especially as long as it is no more than a few files.
> This applies to what you were referring to as hive-ci. As I mentioned on 
> HIVE-29591, hive ci code basically is nothing more than a Dockerfile, 
> considering that originally, hive-dev-box covered a way more than we actually 
> need. I'm ready to provide a vanilla precommit image for this purpose.
>
> Regarding: hive-infra, hive-datasets, I don't have a strong opinion.
>
> I think hive-infra is also better kept in the hive repository. The only thing 
> we might want to take care of is not triggering a full pre-commit each time 
> infra code is pushed to the repo, because it won't test anything (since infra 
> code is not deployed to the GCP project in the PR scope).
>
> Regarding hive-datasets: I agree that huge raw data or dumps cannot be part 
> of the Hive repository, so a separate apache/hive-datasests would suffice, we 
> need to just mention it in our Docker README, and it's done :)
> https://github.com/apache/hive/blob/master/packaging/src/docker/README.md
>
>
> Regards,
> Laszlo Bodor
>
>
>
> On Mon, 4 May 2026 at 09:32, Stamatis Zampetakis <[email protected]> wrote:
>>
>> Hey team,
>>
>> Given the recent activity under HIVE-29590 [1], I would like to revive this 
>> discussion about creating a dedicated Git repository for ci/test/dataset 
>> related stuff. Our lack of reactivity on this topic makes our whole test/ci 
>> infrastructure depend on personal/user specific repositories. This is not 
>> aligned with the ASF way and and makes us depend too much on individual 
>> users/contributors leading to a single point of failure.
>>
>> The lack of dedicated repo blocked various useful contributions in the past 
>> (e.g., [2]) that became stale and eventually were closed without action.
>>
>> Summing up I have two questions:
>> 1. Are there objections in creating a new Git repo under the apache/hive 
>> namespace?
>> 2. What name would you prefer?
>> * https://github.com/apache/hive-datasets
>> * https://github.com/apache/hive-ci
>> * https://github.com/apache/hive-infra
>>
>> At the moment that main things that we want to put there is everything under 
>> HIVE-29590, HIVE-26830, and HIVE-28339.
>>
>> Best,
>> Stamatis
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-29590
>> [2] https://lists.apache.org/thread/4qb3z3yx9ovnxbsr4b02ohz6twlkrlx9
>>
>> On 2025/10/24 12:22:12 Stamatis Zampetakis wrote:
>> > Thanks for starting the discussion Thomas!
>> >
>> > In fact, I would go one step further and instead of storing the
>> > dumps/dockerfiles in personal git repositories such as [1] to create
>> > an apache git repo for that purpose:
>> > https://github.com/apache/hive-datasets
>> > I know that git is not the perfect place to store large files but I
>> > feel that moving from a personal managed repo to a community managed
>> > repo is something worth doing.
>> > Subsequently, having also a corresponding namespace in Docker Hub
>> > makes sense to me.
>> >
>> > Best,
>> > Stamatis
>> >
>> > [1] https://github.com/zabetak/hive-postgres-metastore
>> >
>> > On Fri, Oct 24, 2025 at 12:10 PM Thomas Rebele <[email protected]> 
>> > wrote:
>> > >
>> > > Hi Hive community,
>> > >
>> > > I'm working on creating a docker image for a TPC-DS 30TB metastore with 
>> > > histogram statistics 
>> > > [HIVE-26830](https://issues.apache.org/jira/browse/HIVE-26830).
>> > >
>> > > The previous TPC-DS metastore docker images have been published at 
>> > > https://hub.docker.com/r/zabetak/postgres-tpcds-metastore. Stamatis 
>> > > suggested to create a repo under https://hub.docker.com/u/apache, maybe 
>> > > called "hive-dataset".
>> > >
>> > > What do you think about this approach?
>> > >
>> > > Best regards,
>> > > Thomas Rebele
>> > >
>> >

Re: [DISCUSS] New Git repo for Docker images, datasets/dumps, CI stuff

Reply via email to