davidzollo opened a new issue, #10358:
URL: https://github.com/apache/seatunnel/issues/10358

   ## Background
   
   HubSpot is a leading marketing automation and CRM platform used by over 
200,000 customers worldwide, particularly popular among small to mid-sized 
businesses. It provides comprehensive tools for marketing, sales, customer 
service, and content management.
   
   Currently, SeaTunnel lacks native support for HubSpot as a data source, 
preventing users from integrating CRM and marketing data with their data 
warehouses and analytics platforms.
   
   ## Motivation
   
   - **SMB Market Leader**: HubSpot is the dominant choice for small and 
medium-sized businesses globally
   - **Marketing Automation**: Critical source for marketing campaign data, 
lead tracking, and conversion analytics
   - **API-Only Access**: HubSpot uses REST API exclusively - no JDBC or SQL 
interface available
   - **Data-Driven Marketing**: Organizations need to analyze marketing 
performance, customer journeys, and ROI
   
   ## Proposed Solution
   
   Implement a dedicated HubSpot Source connector using HubSpot REST API v3:
   
   ### Core Features
   
   1. **CRM Objects Support**
      - **Standard Objects**: Contacts, Companies, Deals, Tickets, Products, 
Line Items
      - **Custom Objects**: User-defined objects created in HubSpot
      - **Activities**: Emails, Calls, Meetings, Tasks, Notes
      - **Engagement Data**: Email opens, clicks, form submissions, page views
   
   2. **Marketing Data**
      - **Campaigns**: Email campaigns, ad campaigns, social media campaigns
      - **Forms**: Form submissions and field values
      - **Landing Pages**: Page analytics and conversion data
      - **Lists**: Contact lists and segmentation
      - **Workflows**: Automation workflow execution data
   
   3. **Data Extraction Modes**
      - **Full Snapshot**: Complete object/entity extraction
      - **Incremental**: Based on `lastModifiedDate` or `createDate`
      - **Association-Based**: Extract related objects (e.g., Contacts with 
their Deals)
   
   4. **Authentication**
      - **Private App Access Token**: Recommended for server-to-server 
integration
      - **OAuth 2.0**: For user-context integrations
      - **API Key** (Legacy): Support for existing integrations
   
   ### Configuration Example
   
   ```hocon
   source {
     HubSpot {
       # Authentication
       auth_type = "private_app" # or "oauth2", "api_key"
       access_token = "pat-na1-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
       
       # Object configuration
       object_type = "contacts" # or "companies", "deals", "tickets", 
"custom_objects"
       
       # For custom objects
       custom_object_name = "my_custom_object"
       
       # Extraction mode
       extraction_mode = "incremental" # or "full"
       
       # Properties to fetch
       properties = [
         "firstname",
         "lastname",
         "email",
         "company",
         "lifecyclestage",
         "createdate",
         "lastmodifieddate"
       ]
       # or fetch all properties
       fetch_all_properties = true
       
       # Incremental configuration
       incremental_field = "lastmodifieddate"
       start_date = "2024-01-01T00:00:00Z"
       
       # Associations (relationships)
       include_associations = true
       association_types = ["contacts_to_companies", "contacts_to_deals"]
       
       # Performance tuning
       batch_size = 100
       max_concurrent_requests = 5
       rate_limit_per_second = 100
       request_timeout_ms = 30000
       
       # Filtering
       filter_groups = [
         {
           filters = [
             {
               property_name = "lifecyclestage"
               operator = "EQ"
               value = "customer"
             },
             {
               property_name = "createdate"
               operator = "GT"
               value = "2024-01-01"
             }
           ]
         }
       ]
     }
   }
   ```
   
   ### Marketing Data Example
   
   ```hocon
   source {
     HubSpot {
       auth_type = "private_app"
       access_token = "pat-na1-xxxxxxxx"
       
       # Extract email campaign data
       object_type = "marketing_emails"
       
       properties = [
         "id",
         "name",
         "subject",
         "campaign_name",
         "created",
         "updated",
         "send_time"
       ]
       
       # Include campaign statistics
       include_statistics = true # clicks, opens, bounces, etc.
       
       extraction_mode = "incremental"
       incremental_field = "updated"
       start_date = "2024-01-01"
     }
   }
   ```
   
   ### Custom Object Example
   
   ```hocon
   source {
     HubSpot {
       auth_type = "private_app"
       access_token = "pat-na1-xxxxxxxx"
       
       object_type = "custom_objects"
       custom_object_name = "2-12345678" # Custom object schema ID
       
       fetch_all_properties = true
       
       # Include associations with standard objects
       include_associations = true
       association_types = [
         "custom_to_contacts",
         "custom_to_companies"
       ]
     }
   }
   ```
   
   ## Expected Benefits
   
   1. **SMB Market Access**: Enable thousands of HubSpot users to integrate 
their data with SeaTunnel
   2. **Marketing Analytics**: Unlock marketing ROI analysis, attribution 
modeling, and customer journey analytics
   3. **Unified Customer View**: Combine CRM, marketing, and transactional data 
in a single data warehouse
   4. **Competitive Positioning**: Compete with commercial ETL tools like 
Fivetran, Stitch, and Airbyte Cloud
   5. **Ecosystem Growth**: Attract marketing teams and growth hackers to 
SeaTunnel
   
   ## Technical Considerations
   
   ### Dependencies
   - **HTTP Client**: Use Apache HttpClient or OkHttp for REST API calls
   - **JSON Processing**: Jackson or Gson for JSON serialization/deserialization
   - **OAuth Library**: If supporting OAuth 2.0 authentication
   - **Rate Limiting**: Implement token bucket or sliding window algorithm
   
   ### API Characteristics
   - **Rate Limits**: 
     - Standard: 100 requests per 10 seconds
     - Professional/Enterprise: Higher limits (150-200 req/10s)
     - Need exponential backoff for 429 responses
   
   - **Pagination**: 
     - Cursor-based pagination (after parameter)
     - Maximum 100 records per page
     - Need to handle `paging.next.after` token
   
   - **Incremental Extraction**:
     - Use `lastmodifieddate` or `createdate` properties
     - Filter by date ranges in search API
     - Store last successful timestamp in checkpoint
   
   ### Error Handling
   - **429 Too Many Requests**: Exponential backoff with retry-after header
   - **401/403 Authentication**: Fail fast with clear error message
   - **400 Bad Request**: Validate property names and filter syntax
   - **500 Server Errors**: Retry with exponential backoff
   - **Network Errors**: Configurable retry strategy
   
   ### Testing
   - **HubSpot Developer Account**: Free tier available for testing
   - **Test Sandbox**: HubSpot provides sandbox portals for enterprise customers
   - **Mock Server**: Create mock API server for unit tests
   - **Integration Tests**: Use real HubSpot account with test data
   
   ## Implementation Phases
   
   ### Phase 1: Core CRM Objects (MVP)
   - Private App authentication
   - Contacts, Companies, Deals objects
   - Full snapshot and incremental extraction
   - Basic property selection and filtering
   
   ### Phase 2: Marketing Data
   - Email campaigns and statistics
   - Forms and submissions
   - Landing pages and analytics
   - Lists and segmentation
   
   ### Phase 3: Advanced Features
   - Custom objects support
   - Associations/relationships
   - OAuth 2.0 authentication
   - Advanced filtering and search
   
   ### Phase 4: Enterprise Features
   - Batch property updates (if needed for sink)
   - Webhook-based CDC (using HubSpot webhooks)
   - Multi-portal support
   - Data quality validation
   
   ## References
   
   - [HubSpot API 
Documentation](https://developers.hubspot.com/docs/api/overview)
   - [CRM Objects 
API](https://developers.hubspot.com/docs/api/crm/understanding-the-crm)
   - [Search API](https://developers.hubspot.com/docs/api/crm/search)
   - [Associations 
API](https://developers.hubspot.com/docs/api/crm/associations)
   - [Marketing Events 
API](https://developers.hubspot.com/docs/api/marketing/marketing-events)
   - [API Usage 
Guidelines](https://developers.hubspot.com/docs/api/usage-details)
   
   ## Community Impact
   
   This connector will:
   - Make SeaTunnel accessible to the SMB market segment
   - Enable data-driven marketing and sales analytics
   - Provide an open-source alternative to expensive commercial ETL solutions
   - Attract marketing operations professionals to the Apache SeaTunnel 
community
   
   ---
   
   **Priority**: Medium-High  
   **Estimated Effort**: Medium  
   **Target Release**: 2.3.15 or 3.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to