[
https://issues.apache.org/jira/browse/HADOOP-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074714#comment-18074714
]
ASF GitHub Bot commented on HADOOP-19858:
-----------------------------------------
pan3793 commented on PR #8412:
URL: https://github.com/apache/hadoop/pull/8412#issuecomment-4276656967
> 2. Spark example blocks some of their workflows from running on fork repos
with conditionals [like this
one](https://github.com/apache/spark/blob/a2efe5fdcf9767ee8f113f7487de1a8092d2e7c6/.github/workflows/build_non_ansi.yml#L33):
`if github.repository == 'apache/spark'`.
>
> This PR is missing 2, right?
@ajfabbri yes, but this is irrelevant to security, the intention here is to
save resources, I remember in those days, the forked repos also auto-run cron
jobs by default. - note, the contributor has full control over their forked
repo, so they can remove `if github.repository == 'apache/spark'` condition in
their forked repo if they want.
spark has a lot of profiles (java/scala/python/arrow/pandas versions,
ANSI/non-ANSI, sbt/maven, etc.), a full combination produces a large matrix, so
it selects a part of that to run on PR and push, and schedules daily jobs (you
can find those jobs status in the README.md) on `apache/spark` repo for other
combinations.
> Spark uses these workflows from privileged context on the official repo
(workflow "cron" triggers that can't be triggered by forks) as well as allowing
them to run on forks. The thing I dislike about their example is it lacks clear
separation of privilege. I think we could do better on that.
I'm not sure what your definition of "privileged". I think it's a normal use
case of GHA, the same workflow can be triggered by different events from
different contexts, we just need to be careful with one case - a workflow can
be triggered on the upstream context, and the workflow consumes untrusted code
from PR.
> Are you thinking that we 1. build the hadoop-build (dev-support)
container. 2. run maven build in that container. 3. build pre-installed images
(like hadoop-trunk, hadoop-,
[hadoop-3](https://issues.apache.org/jira/browse/HADOOP-3).5, etc.) 4. run
tests on those installed images?
I don't think this is something we want to do, for unit tests. Generally,
testing requires more dev dependencies, which might not be required by the
runtime, for example, in Maven, we can define dependencies in compile, runtime,
test scopes, when runs UT, it pulls the runtime + test scopes deps into the
classpath, similar things apply to native libs.
TBH, I didn't see many production use cases that deploy Hadoop in
containers, and obviously, the current "pre-installed images" are mostly used
for downstream projects for testing, and it only covers a few cases, at least
kerberized YARN is likely not to work - for hadoop 3.4.x, official hadoop bin
tgz was built on ubuntu 20.04, with openssl 1.x, while the pre-installed images
is based on the ubuntu 22.04, which only has openssl 3.x in apt repo, IIRC it
will fail the kerberized linux container to start.
while building some smoking/integration tests like [spark
kubernetes/integration-tests
](https://github.com/apache/spark/blob/branch-4.1/resource-managers/kubernetes/integration-tests/README.md)
based on the "pre-installed images" might be a good supplement in the future,
but obviously, this is out of the scope of the current goals
> Set up build workflow in GitHub Actions
> ---------------------------------------
>
> Key: HADOOP-19858
> URL: https://issues.apache.org/jira/browse/HADOOP-19858
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: build
> Reporter: Cheng Pan
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]