Copilot commented on code in PR #10383:
URL: https://github.com/apache/seatunnel/pull/10383#discussion_r2719407967
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -117,22 +119,21 @@ public void readProcess(
if (enableSplitFile && split.getLength() > -1) {
actualInputStream = safeSlice(inputStream, split.getStart(),
split.getLength());
}
+ BOMInputStream bomIn = new BOMInputStream(actualInputStream);
+ Charset charset =
+ bomIn.getBOM() == null
+ ? Charset.forName(encoding)
+ : Charset.forName(bomIn.getBOM().getCharsetName());
Review Comment:
The charset detection logic assumes that if a BOM is detected, the BOM's
charset name will be valid. However, there's no null check on
bomIn.getBOM().getCharsetName() when bomIn.getBOM() is not null. If getBOM()
returns a non-null value but getCharsetName() returns null or an invalid
charset name, Charset.forName() will throw an exception. Consider adding
defensive checks or using a try-catch to handle potential
IllegalCharsetNameException or UnsupportedCharsetException.
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -117,22 +119,21 @@ public void readProcess(
if (enableSplitFile && split.getLength() > -1) {
actualInputStream = safeSlice(inputStream, split.getStart(),
split.getLength());
}
+ BOMInputStream bomIn = new BOMInputStream(actualInputStream);
+ Charset charset =
+ bomIn.getBOM() == null
+ ? Charset.forName(encoding)
+ : Charset.forName(bomIn.getBOM().getCharsetName());
+
Review Comment:
The BOMInputStream should be included in the try-with-resources block to
ensure proper resource management. Currently, if an exception occurs between
line 122 and line 135 (before the try block starts), the BOMInputStream will
not be closed properly. Additionally, the finally block at lines 201-207 is
redundant because when BufferedReader (which wraps bomIn) is closed in the
try-with-resources, it will automatically close the underlying BOMInputStream.
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -189,6 +198,12 @@ public void readProcess(
currentFileName);
throw new FileConnectorException(
FileConnectorErrorCode.DATA_DESERIALIZE_FAILED, errorMsg,
e);
+ } finally {
+ try {
+ bomIn.close();
+ } catch (IOException e) {
+ log.warn("Close BOMInputStream failed for file: {}",
currentFileName, e);
+ }
Review Comment:
The finally block that attempts to close BOMInputStream is unnecessary and
potentially problematic. When the BufferedReader is closed by the
try-with-resources statement, it automatically closes all underlying streams
including the BOMInputStream. This finally block creates redundant close
operations and could mask exceptions from the main try block. Additionally, if
an exception occurs before the try block starts (e.g., during BOM detection),
the variable 'bomIn' could be in an inconsistent state, though this is less
likely with the current implementation.
```suggestion
```
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java:
##########
@@ -141,6 +141,35 @@ public void testSpecialQuoteCharForCsvRead() throws
Exception {
}
}
+ @Test
+ public void testUtf8BomCsvWithHeaderRead() throws Exception {
+ URL resource =
CsvReadStrategyTest.class.getResource("/csv/utf8_bom.csv");
+ String path = Paths.get(resource.toURI()).toString();
+ CsvReadStrategy csvReadStrategy = new CsvReadStrategy();
+ LocalConf localConf = new LocalConf(FS_DEFAULT_NAME_DEFAULT);
+ csvReadStrategy.init(localConf);
+ csvReadStrategy.getFileNamesByPath(path);
+
csvReadStrategy.setPluginConfig(ConfigFactory.parseMap(getOptionsWithHeader()));
+ csvReadStrategy.setCatalogTable(
+ CatalogTableUtil.getCatalogTable(
+ "test",
+ new SeaTunnelRowType(
+ new String[] {"id", "name", "age"},
+ new SeaTunnelDataType[] {
+ BasicType.INT_TYPE, BasicType.STRING_TYPE,
BasicType.INT_TYPE
+ })));
Review Comment:
The test schema only defines three columns (id, name, age) but the CSV file
has four columns (id, name, age, gender). While this is intentional to test
that the reader can handle a subset of columns, it would be more robust to also
test that all four columns can be read correctly to ensure the BOM handling
doesn't affect any column, not just the first three. Consider adding an
assertion that verifies the schema can include all columns from the BOM-encoded
CSV.
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
}
}
}
- // read lines
+ // read header lines
List<String> headers = getHeaders(csvParser);
+ // Clean up BOM characters (\ uFEFF) in the header to solve
occasional BOM residue
+ // issues
+ List<String> cleanedHeaders =
+ headers.stream()
+ .map(header -> header.replace("\uFEFF", "").trim())
+ .collect(Collectors.toList());
Review Comment:
The header cleaning logic (removing BOM and trimming) is applied to all
headers regardless of their source. When firstLineAsHeader is false, headers
come from the catalog table schema (see getHeaders method at line 224), not
from the file. In this case, cleaning BOM characters and trimming is
unnecessary and could modify headers incorrectly. The cleaning should only be
applied when headers are read from the CSV file (i.e., when firstLineAsHeader
is true). Consider wrapping this cleaning logic in a conditional check.
```suggestion
// Clean up BOM characters (\uFEFF) in the header to solve
occasional BOM residue
// issues. This should only be applied when headers come from
the CSV file
// (i.e., when firstLineAsHeader is true).
List<String> cleanedHeaders;
if (firstLineAsHeader) {
cleanedHeaders =
headers.stream()
.map(header -> header.replace("\uFEFF",
"").trim())
.collect(Collectors.toList());
} else {
cleanedHeaders = headers;
}
```
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -44,6 +44,7 @@
import org.apache.commons.csv.CSVFormat.Builder;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.io.input.BOMInputStream;
Review Comment:
The code imports and uses BOMInputStream from commons-io, but commons-io is
not explicitly declared as a dependency in the pom.xml for this module. While
it may be available as a transitive dependency, it's a best practice to
explicitly declare direct dependencies. This ensures the dependency is
available and makes the build more maintainable and explicit about what the
module requires.
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
}
}
}
- // read lines
+ // read header lines
List<String> headers = getHeaders(csvParser);
+ // Clean up BOM characters (\ uFEFF) in the header to solve
occasional BOM residue
+ // issues
+ List<String> cleanedHeaders =
+ headers.stream()
+ .map(header -> header.replace("\uFEFF", "").trim())
Review Comment:
The header cleaning logic applies trim() to all headers, which could affect
headers that intentionally have leading or trailing whitespace. While cleaning
BOM characters (\uFEFF) is correct, the unconditional trim() might change the
intended behavior for CSV files where column names legitimately have
whitespace. Consider only removing the BOM character without trimming, or
document this behavior change if it's intentional.
```suggestion
// issues, but preserve any intentional leading/trailing
whitespace
List<String> cleanedHeaders =
headers.stream()
.map(header -> header.replace("\uFEFF", ""))
```
##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
}
}
}
- // read lines
+ // read header lines
List<String> headers = getHeaders(csvParser);
+ // Clean up BOM characters (\ uFEFF) in the header to solve
occasional BOM residue
Review Comment:
The comment contains an escaped backslash that makes it harder to read. The
comment should be "Clean up BOM characters (\uFEFF)" without the extra
backslash before the 'u'.
```suggestion
// Clean up BOM characters (\uFEFF) in the header to solve
occasional BOM residue
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]