Re: [PR] feat(tools): support Parquet and Arrow import with schema/auto modes [tsfile]

via GitHub Thu, 30 Apr 2026 00:54:38 -0700


gengziyand commented on code in PR #793:
URL: https://github.com/apache/tsfile/pull/793#discussion_r3166426909



##########
java/tools/README.md:
##########
@@ -119,9 +104,108 @@ Time INT64,
 Temperature FLOAT,
 Emission DOUBLE,
 ```
-## Commands
 
+In this example:
+- `Group` is a virtual tag column (not in CSV) with default value `Datang`
+- `Region`, `FactoryNumber`, `DeviceNumber` are tag columns read from CSV
+- `Model` and `MaintenanceCycle` are skipped via `SKIP`
+- `Temperature` and `Emission` are automatically derived as FIELD columns
+
+For Parquet / Arrow in schema mode, `source_columns` matches by column 
**name** instead of position. Named SKIP is also supported:
+```
+source_columns
+Time INT64,
+unused_col SKIP,
+Temperature FLOAT,
+Emission DOUBLE,
+```
+
+## CLI Parameters
+
+| Parameter | Description | Required | Default |
+|-----------|------------|----------|---------|
+| -s, --source | Input file or directory | Yes | |
+| -t, --target | Output directory | Yes | |
+| --schema | Schema file path. Omit for auto mode. | No | |
+| --fail_dir | Directory for failed source files | No | failed |
+| --format | Source format: csv / parquet / arrow. Auto-detected by file 
extension if omitted. | No | auto-detect |
+| --table_name | Table name override (auto mode) | No | derived from filename |
+| --time_precision | Time precision override (auto mode): ms / us / ns / s | 
No | ms |
+| --separator | CSV delimiter (auto mode): , / tab / ; | No | , |
+| -b, --block_size | CSV chunk size (e.g. 256M, 1G) | No | 256M |
+| -tn, --thread_num | Thread count for parallel processing | No | 8 |
+
+## Modes
+
+### Schema Mode
+
+Provide a `--schema` file to explicitly define column mapping, types, tags, 
and time column.
+
+```sh
+# CSV
+csv2tsfile.sh --source ./data/csv --target ./output --fail_dir ./failed 
--schema ./schema/import.schema
+csv2tsfile.bat --source .\data\csv --target .\output --fail_dir .\failed 
--schema .\schema\import.schema
+
+# Parquet
+parquet2tsfile.sh --source ./data/parquet --target ./output --fail_dir 
./failed --schema ./schema/import.schema
+parquet2tsfile.bat --source .\data\parquet --target .\output --fail_dir 
.\failed --schema .\schema\import.schema
+
+# Arrow
+arrow2tsfile.sh --source ./data/arrow --target ./output --fail_dir ./failed 
--schema ./schema/import.schema
+arrow2tsfile.bat --source .\data\arrow --target .\output --fail_dir .\failed 
--schema .\schema\import.schema
+```
+
+### Auto Mode
+
+Omit `--schema` to automatically infer column types and detect the time column.
+
+**Auto mode rules:**
+- Time column: must be named exactly `time` or `TIME` (case-sensitive, strict 
match)
+- All other columns become FIELD (no tag inference)
+- CSV type inference uses a 100-row sampling window with promotion chain: 
`BOOLEAN → INT64 → DOUBLE → STRING`

Review Comment:
   Fixed. I updated both README.md and README-zh.md so the documented CSV 
type-promotion behavior now matches the actual implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(tools): support Parquet and Arrow import with schema/auto modes [tsfile]

Reply via email to