joonseolee opened a new pull request, #9898:
URL: https://github.com/apache/seatunnel/pull/9898

   ### Purpose of this pull request
   
   - Add and refine Word (.docx) reading via `WordReadStrategy`.
   - Output schema (10 fields): `element_id`, `element_type`, `text_content`, 
`font_style`, `underline_style`, `font_size`, `font_family`, `text_color`, 
`alignment`, `hyperlink_url`.
   - Process document elements in natural order (paragraphs and tables). 
Footnote text is included within the referencing paragraph’s `text_content`.
   - Due to Apache POI limitations, the minimal extractable unit is a 
paragraph. Run-level styles are aggregated at the paragraph level:
     - `font_style`: NORMAL/BOLD/ITALIC/BOLD_ITALIC
     - `underline_style`: null or concrete style (e.g., SINGLE)
     - `font_size`, `font_family`: first encountered values or null
     - `text_color`: defaults to "000000" when absent
     - `hyperlink_url`: all links in a paragraph concatenated with commas
   
   ### Does this PR introduce any user-facing change?
   
   Yes. The Word reader’s output schema is simplified to 10 fields above. Some 
formatting attributes now return `null` when not explicitly present; 
`text_color` defaults to `"000000"`. Elements are emitted in document order, 
and hyperlinks are aggregated per paragraph.
   
   ### How was this patch tested?
   
   - Added `WordReadStrategyTest` to validate all 10 fields against a sample 
`.docx`.
   - Verified:
     - Paragraph rows contain text and aggregated formatting/links.
     - Table rows produce a single text blob per table; formatting-related 
fields are `null`.
   
   ### Check list
   
   * [x] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)
   
   ### Related Issue
   
   #9715 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to