damccorm commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164458088


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK 
containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global 
namespace. By default, global imports, functions, and variables defined in the 
main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on 
Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of 
the global namespace onto the Dataflow workers.
+For more information, see [Handling 
NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill 
pickler on the Dataflow Runner`.

Review Comment:
   Also, I'd make it clear that the dill pickler is the default.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK 
containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global 
namespace. By default, global imports, functions, and variables defined in the 
main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on 
Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of 
the global namespace onto the Dataflow workers.
+For more information, see [Handling 
NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill 
pickler on the Dataflow Runner`.

Review Comment:
   Doesn't this at least apply to all remote runners using portability (not 
just Dataflow)?



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK 
containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session

Review Comment:
   If we're going to use Dataflow specific language here, we should 
specifically call that out in the section heading. I think this applies to 
other remote runners though.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK 
containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global 
namespace. By default, global imports, functions, and variables defined in the 
main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on 
Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of 
the global namespace onto the Dataflow workers.
+For more information, see [Handling 
NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill 
pickler on the Dataflow Runner`.

Review Comment:
   For example, we use it in our flink portability tests - 
https://github.com/apache/beam/blob/326373715e0ca071d610a03e92626b1957253f81/runners/portability/test_flink_uber_jar.sh#L24



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to