PDGGK opened a new pull request, #37298:
URL: https://github.com/apache/beam/pull/37298
## What changes are being proposed in this pull request?
This PR addresses issue #37209 by significantly improving error messages
when user code fails to serialize (pickle) for distributed execution.
## Why are these changes needed?
Currently, when users pass non-serializable lambdas or closures (e.g.,
capturing a file handle or database connection), they get cryptic low-level
errors like:
```
RuntimeError: Unable to pickle fn <function>: <PicklingError>
```
This doesn't explain:
- **Why** serialization is required
- **What** commonly causes the error
- **How** to fix it
This is especially frustrating for new Apache Beam users who don't
understand distributed execution requirements.
## Changes made:
### 1. Enhanced error message (`ptransform.py`)
The new error message includes:
- **Clear explanation**: "User code must be serializable (picklable) for
distributed execution"
- **Common causes**: "This usually happens when lambdas or closures capture
non-serializable objects like file handles, database connections, or thread
locks"
- **Concrete fixes**:
1. Using module-level functions instead of lambdas
2. Initializing resources in setup() methods
3. Checking what your closure captures
### 2. Broader exception handling
Changed from catching only `RuntimeError` to `(RuntimeError, TypeError,
Exception)` because:
- cloudpickle/dill can raise `TypeError` or `PicklingError`
- Ensures the helpful message appears for all pickling failures
### 3. Exception chaining
Added `from e` to preserve the original exception context and stack trace
for debugging.
### 4. Test coverage
Added `test_callable_non_serializable_error_message()` to verify:
- The error is raised correctly
- The new guidance text appears in the message
## Testing
- ✅ **202 tests passed** in `ptransform_test.py`
- ✅ New test explicitly verifies the error message content
- ✅ Manual testing with non-serializable closures confirms the improved
message
## Impact
- **Developer Experience**: Significantly reduces debugging time for
serialization issues
- **Stability**: No change to execution logic; pure diagnostic improvement
- **Compatibility**: No impact on existing pipelines (still raises
RuntimeError)
## Example
**Before:**
```
RuntimeError: Unable to pickle fn <function>: cannot serialize
<_io.TextIOWrapper>
```
**After:**
```
RuntimeError: Unable to pickle fn <function>: cannot serialize
<_io.TextIOWrapper>.
User code must be serializable (picklable) for distributed execution.
This usually happens when lambdas or closures capture non-serializable
objects
like file handles, database connections, or thread locks. Try: (1) using
module-level functions instead of lambdas, (2) initializing resources in
setup() methods, (3) checking what your closure captures.
```
Fixes #37209
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]