edespino commented on issue #1404:
URL: https://github.com/apache/cloudberry/issues/1404#issuecomment-3419282843
## Investigation Summary: gprecoverseg PPMC Notice Parsing Error
### Bug Report
User reported `gprecoverseg` failure with the following error:
20251017:10:58:21:792595
gprecoverseg:cbdb01:gpadmin-[CRITICAL]:-gprecoverseg failed.
(Reason='invalid literal for int() with base 10: "#
--------------------------------------------------------------------\n
NOTICE from the Apache Cloudberry PPMC\n#
--------------------------------------------------------------------\n
This file u') exiting...
The error indicates that `gprecoverseg` attempted to parse a string
containing the Apache Cloudberry PPMC notice header as an integer, causing a
Python `ValueError`.
---
### Investigation Process
We conducted systematic testing to identify the root cause:
#### ✅ Test 1: SSH Command Output (CLEAN)
```bash
ssh sdw1 "echo 12345"
# Output: 12345 (no contamination)
ssh -t sdw1 "echo 12345"
# Output: 12345 (no contamination)
Result: SSH is NOT the contamination vector. The .bashrc interactive check
prevents greenplum_path.sh from executing during non-interactive SSH sessions.
✅ Test 2: Configuration Files in Segment Data Directories (CLEAN)
for host in cdw sdw1 sdw2 sdw3 sdw4; do
ssh $host "find /data* -type f \( -name '*.conf' -o -name '*.opts' \) | \
xargs grep -l 'NOTICE from the Apache Cloudberry PPMC' 2>/dev/null"
done
Result: No contaminated files found in segment data directories on a clean
installation.
📋 Analysis of .bashrc Protection
All segment hosts have this protection in .bashrc:
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
# ... later ...
if [ -f /usr/local/cloudberry/greenplum_path.sh ]; then
source /usr/local/cloudberry/greenplum_path.sh
fi
This prevents greenplum_path.sh from executing during non-interactive SSH,
which is why the bug cannot be reproduced on clean systems.
---
Root Cause Analysis
Based on the error message and investigation, the bug occurs when:
1. gprecoverseg reads a file that should contain numeric data (PID, port,
dbid, etc.)
2. That file has been contaminated with the PPMC notice header text
3. Python attempts int(file_content) and raises ValueError
The contamination likely occurred during one of these scenarios:
Scenario A: Version Upgrade with Script Output Redirection
# During cluster upgrade or reconfiguration
source /usr/local/cloudberry/greenplum_path.sh # PPMC notice prints to
stdout
some_command_that_generates_config > /data/primary/gpseg7/some_file
# PPMC notice gets written into the file
Scenario B: Custom Wrapper Scripts
A wrapper script that sources greenplum_path.sh with stdout redirected:
#!/bin/bash
source /usr/local/cloudberry/greenplum_path.sh > recovery_config.txt
# PPMC notice becomes first lines of recovery_config.txt
Scenario C: Earlier greenplum_path.sh Versions
Earlier versions of greenplum_path.sh may have:
- Printed to stdout unconditionally (not checking for interactive shells)
- Printed before the interactive check in certain initialization scenarios
- Been sourced from /etc/profile or /etc/bash.bashrc (runs before .bashrc
protection)
---
Why This Bug is Hard to Reproduce
On clean, newly-initialized clusters:
- ✅ .bashrc interactive check prevents PPMC notice during SSH
- ✅ No contaminated files exist in data directories
- ✅ greenplum_path.sh notice is properly isolated from command output
The bug only affects systems where:
- ❌ Cluster was initialized/upgraded during a transition period with a
different greenplum_path.sh behavior
- ❌ Custom scripts source greenplum_path.sh with redirected stdout
- ❌ Files were written during an interactive session that captured the notice
---
Potential Files That Could Be Contaminated
Based on the error context (/u00/cbdb/primary/gpseg7), candidates include:
1. Recovery metadata files - gprecoverseg-specific configuration
2. postmaster.opts - May be parsed for port numbers
3. postgresql.auto.conf - Auto-generated configuration
4. internal.auto.conf - Internal Cloudberry settings
5. Custom recovery tracking files - Created during previous recovery attempts
---
Recommendations
For Users Experiencing This Bug
1. Identify contaminated files:
# On affected segment host
grep -r "NOTICE from the Apache Cloudberry PPMC" /u00/cbdb/primary/gpseg7/
2. Clean the contaminated files:
- Remove the PPMC notice header lines
- Restore from backup if available
- Reinitialize the segment if necessary
3. Prevent recurrence:
- Audit custom scripts that source greenplum_path.sh
- Ensure stdout redirection doesn't capture notice text
- Check /etc/profile, /etc/bash.bashrc for unsafe sourcing
For Cloudberry Development Team
1. Fix greenplum_path.sh to NEVER print to stdout:
# Print notice to stderr instead
echo "NOTICE..." >&2
# OR: Only show in truly interactive shells
if [[ $- == *i* ]] && [[ -t 0 ]]; then
echo "NOTICE..."
fi
2. Add input validation in gprecoverseg:
# Before: int(file_content)
# After:
content = file_content.strip()
if not content.isdigit():
raise ValueError(f"Expected numeric value, got: {content[:50]}")
3. Add file content checks during recovery:
- Detect unexpected content in configuration files
- Provide clear error messages about contamination
- Auto-clean known contamination patterns
4. Document the issue:
- Add to upgrade notes
- Include in troubleshooting guide
- Warn about stdout redirection with greenplum_path.sh
---
Verification Needed
To fully confirm the root cause, we need:
1. Access to the affected system to examine contaminated files
2. gprecoverseg source code review to identify exactly which file it's
parsing
3. Version history of when the PPMC notice was added to greenplum_path.sh
---
Related Issues
This is similar to issues seen in other database systems where:
- Shell scripts print banners/notices to stdout
- Tools parse command output expecting clean numeric values
- SSH wrappers or automation scripts get contaminated output
Best Practice: Shell scripts that are sourced should NEVER print to stdout
unless that's their explicit purpose.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]