[
https://issues.apache.org/jira/browse/BEAM-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17549039#comment-17549039
]
Danny McCormick commented on BEAM-11076:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/20647
> Go Direct Runner Improvements
> -----------------------------
>
> Key: BEAM-11076
> URL: https://issues.apache.org/jira/browse/BEAM-11076
> Project: Beam
> Issue Type: Improvement
> Components: sdk-go
> Reporter: Robert Burke
> Priority: P3
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> The Go SDK has a simple direct runner intended for basic batch framework
> testing. That is, it's only suitable for the barest tests, and not that it
> ensures that the basics work for arbitrary pipelines.
> The runner has the following features:
> * Operates on the direct pipeline graph without marshalling through the beam
> protos.
> ** This prevents it from validating that the pipeline is valid for portable
> runners.
> * Executes the whole pipeline as a single bundle, on a single worker thread.
> "in process"
> ** This renders it only suitable for very small data sets, that likely
> operate in memory.
> * Doesn't marshal elements.
> ** While this avoids notionally unnecessary work, it's another reason why
> users will run into errors after using the direct runner to "validate" their
> pipeline before moving to Spark or Flink.
> Further, the runner hasn't been validated for beam semantics, nor have more
> complex features of the Beam Model been implemented or validated. This makes
> it unsuitable for more than it's current use for demoing the SDK in basic
> batch operation, and the light use it has testing the SDK itself.
> However, implementing full beam semantics for a runner, even without the
> distributed portion is a project in itself. It's part of the beam design that
> implementing the semantics for a beam runner to be more complicated on the
> runner side vs the SDK side.
> But there's no reason why we can't improve the Go Direct Runner to match all
> semantics required of beam for single machine contexts.
> In particular the various improvements below could be made (and should
> probably be sharded into separate sub task JIRAs as required):
> * Convert the Go Direct Runner to a "Go Portable Runner" instead, which
> means implementing the Job Management and FnApi protocols. This would
> ensure that all runners are operating the Go SDK workers in the same way, via
> the harness.
> ** This doesn't preclude "go awareness" for operating everthing in a single
> binary, or later re-optimizing to avoid serialization.
> * Allow the runner to execute "headless" (as a job submission server).
> * Allow the runner to execute more than a single bundle at once.
> ** Enabling better use of CPU cores in single execution mode.
> * Add loopback and docker execution mode support, in addition to the Go "in
> process" support it has.
> * Once the runner can execute portable pipelines done, it becomes possible
> to run the Python and Java Runner Validation Tests against the runner to
> validate all the features of the [Beam Programming
> Model|https://beam.apache.org/documentation/programming-guide]
> ** Each feature / TestSuite of which should be handled in separate JIRAs.
> ** Adding jenkins runs of those passing tests to ensure ongoing validation
> of the runner against the model.
>
> A good place to start is being able to run and execute pipelines on the
> Python Portable runner, which implements all beam semantics correctly.
> Instructions for doing so are on [Go Tips page in the Dev
> Wiki|https://cwiki.apache.org/confluence/display/BEAM/Go+Tips].
>
> Direct Runner Code:
> [https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/direct]
> SDK Harness Code:
> [https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/harness.go]
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)