zhenyue-xu opened a new pull request, #9660:
URL: https://github.com/apache/seatunnel/pull/9660

   ### Purpose of this pull request
   
   This pull request restores configurable CSV delimiter support in the file 
connector. Previously, in PR #9066, the CSV delimiter configuration was removed 
with the suggestion to "use text format if you want custom delimiter." However, 
this approach has significant limitations:
   
   1. **Text format cannot handle standard CSV features** as defined in [RFC 
4180](https://datatracker.ietf.org/doc/html/rfc4180#page-2):
      - Fields enclosed in double quotes
      - Escaped quotes (`""` for literal `"`)
      - Multi-line fields within quotes
      - Delimiters within quoted fields
   
   2. **Text format only supports simple delimited data** without any 
CSV-specific parsing logic
   
   This PR adds back the `csv_field_delimiter` configuration option to properly 
support CSV files with custom delimiters (semicolon, tab, pipe, etc.) while 
maintaining full CSV standard compliance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR introduces a new configuration option for CSV file reading:
   
   **New configuration:**
   - `csv_field_delimiter`: Configurable CSV field delimiter (default: `,`)
   
   
   This change is backward compatible - existing configurations without the 
`csv_field_delimiter` option will continue to use comma as the default 
delimiter.
   
   ### How was this patch tested?
   
   The patch was tested with:
   
   1. **Unit tests** for the new configuration option parsing
   2. **Integration tests** with various CSV formats:
      - Default comma delimiter (backward compatibility)
      - Semicolon delimiter with quoted fields
      - Tab delimiter with escaped quotes
      - Pipe delimiter with multi-line fields
   
   3. **Manual testing** with real CSV files:
      ```
      "1";"b
      a";"10"
      "2";"b";"100"
      ```
   
   问题是 Markdown 的 CSV 语法高亮在处理多行字段时会出现渲染问题。使用普通的代码块(不指定语言)可以避免这个问题。
   
   All tests confirmed that the CSV format properly handles standard CSV 
features while respecting the custom delimiter, unlike the text format which 
would incorrectly parse the above example.
   
   ### Check list
   
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to