edespino commented on issue #1404:
URL: https://github.com/apache/cloudberry/issues/1404#issuecomment-3419282843

   ## Investigation Summary: gprecoverseg PPMC Notice Parsing Error
   
   ### Bug Report
   User reported `gprecoverseg` failure with the following error:
   
   20251017:10:58:21:792595 
gprecoverseg:cbdb01:gpadmin-[CRITICAL]:-gprecoverseg failed.
   (Reason='invalid literal for int() with base 10: "# 
--------------------------------------------------------------------\n
   NOTICE from the Apache Cloudberry PPMC\n# 
--------------------------------------------------------------------\n
   
   This file u') exiting...
   
   
   The error indicates that `gprecoverseg` attempted to parse a string 
containing the Apache Cloudberry PPMC notice header as an integer, causing a 
Python `ValueError`.
   
   ---
   
   ### Investigation Process
   
   We conducted systematic testing to identify the root cause:
   
   #### ✅ Test 1: SSH Command Output (CLEAN)
   ```bash
   ssh sdw1 "echo 12345"
   # Output: 12345 (no contamination)
   
   ssh -t sdw1 "echo 12345"
   # Output: 12345 (no contamination)
   
   Result: SSH is NOT the contamination vector. The .bashrc interactive check 
prevents greenplum_path.sh from executing during non-interactive SSH sessions.
   
   ✅ Test 2: Configuration Files in Segment Data Directories (CLEAN)
   
   for host in cdw sdw1 sdw2 sdw3 sdw4; do
       ssh $host "find /data* -type f \( -name '*.conf' -o -name '*.opts' \) | \
       xargs grep -l 'NOTICE from the Apache Cloudberry PPMC' 2>/dev/null"
   done
   
   Result: No contaminated files found in segment data directories on a clean 
installation.
   
   📋 Analysis of .bashrc Protection
   
   All segment hosts have this protection in .bashrc:
   # If not running interactively, don't do anything
   case $- in
       *i*) ;;
         *) return;;
   esac
   
   # ... later ...
   if [ -f /usr/local/cloudberry/greenplum_path.sh ]; then
     source /usr/local/cloudberry/greenplum_path.sh
   fi
   
   This prevents greenplum_path.sh from executing during non-interactive SSH, 
which is why the bug cannot be reproduced on clean systems.
   
   ---
   Root Cause Analysis
   
   Based on the error message and investigation, the bug occurs when:
   
   1. gprecoverseg reads a file that should contain numeric data (PID, port, 
dbid, etc.)
   2. That file has been contaminated with the PPMC notice header text
   3. Python attempts int(file_content) and raises ValueError
   
   The contamination likely occurred during one of these scenarios:
   
   Scenario A: Version Upgrade with Script Output Redirection
   
   # During cluster upgrade or reconfiguration
   source /usr/local/cloudberry/greenplum_path.sh  # PPMC notice prints to 
stdout
   some_command_that_generates_config > /data/primary/gpseg7/some_file
   # PPMC notice gets written into the file
   
   Scenario B: Custom Wrapper Scripts
   
   A wrapper script that sources greenplum_path.sh with stdout redirected:
   #!/bin/bash
   source /usr/local/cloudberry/greenplum_path.sh > recovery_config.txt
   # PPMC notice becomes first lines of recovery_config.txt
   
   Scenario C: Earlier greenplum_path.sh Versions
   
   Earlier versions of greenplum_path.sh may have:
   - Printed to stdout unconditionally (not checking for interactive shells)
   - Printed before the interactive check in certain initialization scenarios
   - Been sourced from /etc/profile or /etc/bash.bashrc (runs before .bashrc 
protection)
   
   ---
   Why This Bug is Hard to Reproduce
   
   On clean, newly-initialized clusters:
   - ✅ .bashrc interactive check prevents PPMC notice during SSH
   - ✅ No contaminated files exist in data directories
   - ✅ greenplum_path.sh notice is properly isolated from command output
   
   The bug only affects systems where:
   - ❌ Cluster was initialized/upgraded during a transition period with a 
different greenplum_path.sh behavior
   - ❌ Custom scripts source greenplum_path.sh with redirected stdout
   - ❌ Files were written during an interactive session that captured the notice
   
   ---
   Potential Files That Could Be Contaminated
   
   Based on the error context (/u00/cbdb/primary/gpseg7), candidates include:
   
   1. Recovery metadata files - gprecoverseg-specific configuration
   2. postmaster.opts - May be parsed for port numbers
   3. postgresql.auto.conf - Auto-generated configuration
   4. internal.auto.conf - Internal Cloudberry settings
   5. Custom recovery tracking files - Created during previous recovery attempts
   
   ---
   Recommendations
   
   For Users Experiencing This Bug
   
   1. Identify contaminated files:
   # On affected segment host
   grep -r "NOTICE from the Apache Cloudberry PPMC" /u00/cbdb/primary/gpseg7/
   2. Clean the contaminated files:
     - Remove the PPMC notice header lines
     - Restore from backup if available
     - Reinitialize the segment if necessary
   3. Prevent recurrence:
     - Audit custom scripts that source greenplum_path.sh
     - Ensure stdout redirection doesn't capture notice text
     - Check /etc/profile, /etc/bash.bashrc for unsafe sourcing
   
   For Cloudberry Development Team
   
   1. Fix greenplum_path.sh to NEVER print to stdout:
   # Print notice to stderr instead
   echo "NOTICE..." >&2
   
   # OR: Only show in truly interactive shells
   if [[ $- == *i* ]] && [[ -t 0 ]]; then
       echo "NOTICE..."
   fi
   2. Add input validation in gprecoverseg:
   # Before: int(file_content)
   # After:
   content = file_content.strip()
   if not content.isdigit():
       raise ValueError(f"Expected numeric value, got: {content[:50]}")
   3. Add file content checks during recovery:
     - Detect unexpected content in configuration files
     - Provide clear error messages about contamination
     - Auto-clean known contamination patterns
   4. Document the issue:
     - Add to upgrade notes
     - Include in troubleshooting guide
     - Warn about stdout redirection with greenplum_path.sh
   
   ---
   Verification Needed
   
   To fully confirm the root cause, we need:
   
   1. Access to the affected system to examine contaminated files
   2. gprecoverseg source code review to identify exactly which file it's 
parsing
   3. Version history of when the PPMC notice was added to greenplum_path.sh
   
   ---
   Related Issues
   
   This is similar to issues seen in other database systems where:
   - Shell scripts print banners/notices to stdout
   - Tools parse command output expecting clean numeric values
   - SSH wrappers or automation scripts get contaminated output
   
   Best Practice: Shell scripts that are sourced should NEVER print to stdout 
unless that's their explicit purpose.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to