davidzollo opened a new issue, #10356:
URL: https://github.com/apache/seatunnel/issues/10356

   ## Background
   
   Salesforce is the world's leading Customer Relationship Management (CRM) 
platform with over 20% market share globally. It serves as the single source of 
truth for customer data, sales opportunities, service cases, and marketing 
campaigns across millions of enterprises.
   
   Currently, SeaTunnel lacks native support for Salesforce as a data source, 
preventing users from building data pipelines that integrate CRM data with 
their data warehouses and analytics platforms.
   
   ## Motivation
   
   - **Market Leader**: Salesforce dominates the enterprise CRM space with the 
largest user base globally
   - **API-Only Access**: Salesforce uses REST/SOAP APIs exclusively - there is 
no JDBC support
   - **Critical Business Data**: Organizations need to sync CRM data (accounts, 
contacts, opportunities, cases, etc.) to data warehouses for analytics
   - **Real-Time Integration**: Support for both batch extraction and change 
data capture (CDC) via streaming APIs
   
   ## Proposed Solution
   
   Implement a dedicated Salesforce Source connector using Salesforce REST API 
and Bulk API 2.0:
   
   ### Core Features
   
   1. **Multiple API Support**
      - REST API for real-time queries and small datasets
      - Bulk API 2.0 for large-scale data extraction (millions of records)
      - Streaming API for real-time change data capture (CDC)
      - Support for SOQL (Salesforce Object Query Language)
   
   2. **Object Support**
      - Standard objects (Account, Contact, Lead, Opportunity, Case, etc.)
      - Custom objects
      - Metadata discovery and schema inference
      - Relationship traversal (parent-child, lookup, master-detail)
   
   3. **Data Extraction Modes**
      - **Full Snapshot**: Extract complete object data
      - **Incremental**: Extract records modified after a specific timestamp
      - **CDC**: Real-time streaming of change events via PushTopic or Change 
Data Capture
   
   4. **Authentication**
      - OAuth 2.0 (Authorization Code, JWT Bearer, Client Credentials)
      - Username-Password flow (for development/testing)
      - Connected App integration
   
   ### Configuration Example
   
   ```hocon
   source {
     Salesforce {
       # Authentication
       auth_type = "oauth2_jwt"
       client_id = "your_connected_app_client_id"
       client_secret = "your_client_secret"
       username = "[email protected]"
       private_key_file = "/path/to/private-key.pem"
       
       # Instance configuration
       instance_url = "https://yourinstance.salesforce.com";
       api_version = "v59.0"
       
       # Data extraction
       object_name = "Account"
       extraction_mode = "incremental" # or "full", "cdc"
       
       # Query configuration
       soql_query = "SELECT Id, Name, Industry, AnnualRevenue FROM Account 
WHERE CreatedDate > LAST_N_DAYS:30"
       # or use simple fields selection
       fields = ["Id", "Name", "Industry", "AnnualRevenue"]
       filter = "CreatedDate > LAST_N_DAYS:30"
       
       # Incremental configuration
       incremental_field = "LastModifiedDate"
       start_date = "2024-01-01T00:00:00Z"
       
       # Performance tuning
       batch_size = 2000
       max_retries = 3
       request_timeout_ms = 60000
       
       # Schema options
       include_deleted = false
       flatten_relationships = true
     }
   }
   ```
   
   ### CDC Configuration Example
   
   ```hocon
   source {
     Salesforce {
       auth_type = "oauth2_jwt"
       # ... authentication config ...
       
       extraction_mode = "cdc"
       object_name = "Opportunity"
       
       # CDC options
       cdc_type = "change_data_capture" # or "push_topic"
       replay_id = -1 # -1 for new events, -2 for all retained events
       
       # For PushTopic
       push_topic_name = "/topic/OpportunityUpdates"
     }
   }
   ```
   
   ## Expected Benefits
   
   1. **Enterprise Integration**: Enable thousands of Salesforce customers to 
use SeaTunnel for data integration
   2. **Complete Data Access**: Support all Salesforce objects and relationship 
types
   3. **High Performance**: Bulk API 2.0 can extract millions of records 
efficiently
   4. **Real-Time Capabilities**: CDC support enables near-real-time data 
synchronization
   5. **Ecosystem Growth**: Position SeaTunnel as a viable alternative to 
commercial ETL tools like Fivetran, Airbyte Cloud
   
   ## Technical Considerations
   
   - **Dependencies**: 
     - Salesforce REST API client library or custom HTTP client
     - OAuth 2.0 library for authentication
     - Jackson/Gson for JSON parsing
   
   - **Rate Limiting**: 
     - Implement exponential backoff for API limits
     - Support for concurrent API call tracking
     - Configurable request throttling
   
   - **Error Handling**:
     - Handle API errors (INVALID_SESSION, LIMIT_EXCEEDED, etc.)
     - Retry logic with configurable strategies
     - Failed record tracking and logging
   
   - **Testing**:
     - Salesforce Developer Edition sandbox for integration tests
     - Mock API server for unit tests
     - Support for Salesforce scratch orgs in CI/CD
   
   ## Implementation Phases
   
   ### Phase 1: Basic Support (MVP)
   - OAuth 2.0 authentication
   - REST API for full snapshot extraction
   - Standard object support with SOQL queries
   - Basic schema inference
   
   ### Phase 2: Enterprise Features
   - Bulk API 2.0 for large-scale extraction
   - Incremental extraction by modified date
   - Custom object support
   - Advanced field mapping and transformations
   
   ### Phase 3: Real-Time CDC
   - Streaming API integration
   - Change Data Capture events
   - PushTopic support
   - Exactly-once semantics
   
   ## References
   
   - [Salesforce REST API Developer 
Guide](https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/)
   - [Bulk API 2.0 
Documentation](https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/bulk_api_2_0.htm)
   - [Change Data Capture Developer 
Guide](https://developer.salesforce.com/docs/atlas.en-us.change_data_capture.meta/change_data_capture/)
   - [SOQL and SOSL 
Reference](https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/)
   
   ## Community Impact
   
   This connector will:
   - Make SeaTunnel competitive with commercial ETL tools in the CRM 
integration space
   - Enable data-driven decision making for sales, marketing, and customer 
service teams
   - Attract enterprise users who need reliable Salesforce integration
   
   ---
   
   **Priority**: High  
   **Estimated Effort**: Medium-High  
   **Target Release**: 2.3.14 or 3.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to