Re: [PR] [Feature][Transform] Introduce tika transform [seatunnel]

via GitHub Sun, 19 Oct 2025 01:14:21 -0700


Hisoka-X commented on code in PR #9862:
URL: https://github.com/apache/seatunnel/pull/9862#discussion_r2442908870



##########
docs/zh/transform-v2/tikadocument.md:
##########
@@ -0,0 +1,307 @@
+# TikaDocument
+
+> TikaDocument 转换插件
+
+## 描述
+
+`TikaDocument` 转换插件使用 [Apache Tika](https://tika.apache.org/) 
从各种文档格式中提取文本内容和元数据，包括 PDF、Microsoft Office 
文档（Word、Excel、PowerPoint）、纯文本、HTML、XML 和许多其他文件格式。该转换将二进制文档数据转换为结构化的文本内容和元数据字段。
+
+该插件支持全面的错误处理、内容处理选项，并可以处理二进制数据和 Base64 编码的文档内容。
+
+有关 Apache Tika 和支持的格式的更多信息，请参见 [Apache Tika 
文档](https://tika.apache.org/2.9.2/index.html)。
+
+## 属性
+
+| 名称                                 | 类型     | 是否必须 | 默认值        | 描述         
                                                                                
         |
+|------------------------------------|--------|------|------------|-----------------------------------------------------------------------------------------------------|
+| source_field                       | string | 是    | -          | 
包含文档数据（二进制或 Base64）的源字段名称                                                       
                  |
+| output_fields                      | map    | 否    | 自动生成       | 
提取内容到输出字段名称的映射                                                                  
                    |
+| parse_options.extract_text         | bool   | 否    | true       | 
是否从文档中提取文本内容                                                                    
                    |
+| parse_options.extract_metadata     | bool   | 否    | true       | 是否提取文档元数据  
                                                                                
         |
+| parse_options.max_string_length    | int    | 否    | 10000      | 
提取的文本内容的最大长度                                                                    
                    |
+| content_processing.remove_empty_lines | bool | 否    | false      | 
是否从提取的文本中移除空行                                                                   
                    |
+| content_processing.trim_whitespace | bool   | 否    | false      | 
是否修剪提取文本的空白字符                                                                   
                    |
+| content_processing.normalize_whitespace | bool | 否  | false      | 
是否将多个空白字符标准化为单个空格                                                               
                    |
+| content_processing.min_content_length | int  | 否    | 0          | 
最小内容长度阈值（较短的内容将被跳过）                                                             
                    |
+| error_handling.on_parse_error      | enum   | 否    | skip       | 
如何处理文档解析错误：`fail`、`skip`、`null`                                                 
                     |
+| error_handling.on_unsupported_format | enum | 否    | skip       | 
如何处理不支持的文档格式：`fail`、`skip`、`null`                                               
                    |
+| error_handling.log_errors          | bool   | 否    | false      | 是否记录错误消息   
                                                                                
         |
+| timeout_ms                         | long   | 否    | 30000      | 
文档处理超时时间（毫秒）                                                                    
                    |
+
+### common options [string]
+
+转换插件的常见参数，请参考 [Transform Plugin](common-options.md) 了解详情
+
+### source_field [string]
+
+包含文档数据的输入字段名称。该字段应包含以下类型之一：
+- 二进制文档数据（字节数组）
+- Base64 编码的文档数据（字符串）
+
+### output_fields [map]
+
+指定应输出哪些提取字段及其对应字段名称的映射。如果未指定，插件将根据解析选项自动生成输出字段。
+
+**默认输出字段：**
+```hocon
+output_fields {
+    content = "extracted_text"        # 提取的文本内容
+    content_type = "mime_type"        # 文档的 MIME 类型  
+    title = "doc_title"               # 文档标题（如果可用）
+}
+```
+
+**自定义输出字段：**
+```hocon
+output_fields {
+    content = "document_content"      # 文档内容
+    content_type = "file_type"        # 文件类型
+    title = "document_title"          # 文档标题
+    author = "document_author"        # 文档作者
+    subject = "document_subject"      # 文档主题
+    keywords = "document_keywords"    # 文档关键词
+    language = "document_language"    # 文档语言
+    created_date = "creation_date"    # 创建日期
+    modified_date = "modification_date" # 修改日期
+    metadata = "all_metadata"         # 所有元数据
+}
+```
+
+### parse_options
+
+#### extract_text [bool]
+
+是否从文档中提取文本内容。启用时，插件将从文档中提取可读的文本。
+
+#### extract_metadata [bool]
+
+是否提取文档元数据，如标题、作者、创建日期等。
+
+#### max_string_length [int]
+
+提取的文本内容的最大长度。超过此限制的文本将被截断。
+
+### content_processing
+
+#### remove_empty_lines [bool]
+
+是否从提取的文本内容中删除空行。
+
+#### trim_whitespace [bool]
+
+是否修剪提取文本的前导和尾随空白字符。
+
+#### normalize_whitespace [bool]
+
+是否将多个连续的空白字符标准化为单个空格。
+
+#### min_content_length [int]
+
+提取内容的最小长度阈值。短于此长度的内容将被视为无效，并根据错误处理策略进行处理。
+
+### error_handling
+
+#### on_parse_error [enum]
+
+指定如何处理文档解析错误：
+- `fail`：抛出异常并停止处理
+- `skip`：跳过当前行并继续处理
+- `null`：用 null 值填充输出字段
+
+#### on_unsupported_format [enum]
+
+指定如何处理不支持的文档格式：
+- `fail`：抛出异常并停止处理
+- `skip`：跳过当前行并继续处理
+- `null`：用 null 值填充输出字段
+
+#### log_errors [bool]
+
+是否在发生处理失败时记录详细的错误消息。
+
+### timeout_ms [long]
+
+文档处理超时时间（毫秒）。如果文档处理时间超过此超时时间，将被终止并根据错误处理策略进行处理。
+
+## 支持的文档格式
+
+TikaDocument 转换通过 Apache Tika 支持多种文档格式：
+
+- **文本格式**：TXT、RTF、CSV
+- **PDF 文档**：PDF
+- **Microsoft Office**：DOC、DOCX、XLS、XLSX、PPT、PPTX
+- **OpenOffice/LibreOffice**：ODT、ODS、ODP
+- **网页格式**：HTML、XML、XHTML
+- **压缩格式**：ZIP、TAR、GZIP
+- **图像格式**（如果支持 OCR）：JPEG、PNG、TIFF、GIF
+- **邮件格式**：MSG、EML、MBOX
+- **电子书格式**：EPUB、MOBI
+- **以及更多格式**
+
+## 示例
+
+### 基本文档处理
+
+```hocon
+transform {
+  TikaDocument {
+    source_field = "document_data"
+    output_fields = {
+      content = "extracted_text"
+      content_type = "mime_type"
+    }
+  }
+}
+```
+
+### 高级配置与内容处理
+
+```hocon
+transform {
+  TikaDocument {
+    source_field = "file_content"
+    output_fields = {
+      content = "document_text"
+      content_type = "file_type"
+      title = "doc_title"
+      author = "doc_author"
+      metadata = "all_metadata"
+    }
+    parse_options = {
+      extract_text = true
+      extract_metadata = true
+      max_string_length = 50000
+    }
+    content_processing = {
+      remove_empty_lines = true
+      trim_whitespace = true
+      normalize_whitespace = true
+      min_content_length = 10
+    }
+    error_handling = {
+      on_parse_error = "skip"
+      on_unsupported_format = "null"
+      log_errors = true
+    }
+    timeout_ms = 60000
+  }
+}
+```
+
+### 多表处理
+
+```hocon
+transform {
+  TikaDocument {
+    source_field = "document_data"
+    output_fields = {
+      content = "extracted_content"
+      content_type = "document_type"
+    }
+    multi_tables = true

Review Comment:
   why we need `multi_tables` field? Other transform with multi table parse 
does not had this field. Please refer 
https://seatunnel.apache.org/docs/2.3.12/transform-v2/transform-multi-table



##########
docs/en/transform-v2/tikadocument.md:
##########
@@ -0,0 +1,307 @@
+# TikaDocument
+
+> TikaDocument Transform Plugin
+
+## Description
+
+The `TikaDocument` transform plugin uses [Apache 
Tika](https://tika.apache.org/) to extract text content and metadata from 
various document formats including PDF, Microsoft Office documents (Word, 
Excel, PowerPoint), plain text, HTML, XML, and many other file formats. This 
transform converts binary document data into structured text content and 
metadata fields.

Review Comment:
   > This transform converts binary document
   
   This part is easily overlooked by users. I think we can open a separate 
chapter to introduce which source configurations can use this transform.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Transform] Introduce tika transform [seatunnel]

Reply via email to