Copilot commented on code in PR #10383:
URL: https://github.com/apache/seatunnel/pull/10383#discussion_r2719407967


##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -117,22 +119,21 @@ public void readProcess(
         if (enableSplitFile && split.getLength() > -1) {
             actualInputStream = safeSlice(inputStream, split.getStart(), 
split.getLength());
         }
+        BOMInputStream bomIn = new BOMInputStream(actualInputStream);
+        Charset charset =
+                bomIn.getBOM() == null
+                        ? Charset.forName(encoding)
+                        : Charset.forName(bomIn.getBOM().getCharsetName());

Review Comment:
   The charset detection logic assumes that if a BOM is detected, the BOM's 
charset name will be valid. However, there's no null check on 
bomIn.getBOM().getCharsetName() when bomIn.getBOM() is not null. If getBOM() 
returns a non-null value but getCharsetName() returns null or an invalid 
charset name, Charset.forName() will throw an exception. Consider adding 
defensive checks or using a try-catch to handle potential 
IllegalCharsetNameException or UnsupportedCharsetException.



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -117,22 +119,21 @@ public void readProcess(
         if (enableSplitFile && split.getLength() > -1) {
             actualInputStream = safeSlice(inputStream, split.getStart(), 
split.getLength());
         }
+        BOMInputStream bomIn = new BOMInputStream(actualInputStream);
+        Charset charset =
+                bomIn.getBOM() == null
+                        ? Charset.forName(encoding)
+                        : Charset.forName(bomIn.getBOM().getCharsetName());
+

Review Comment:
   The BOMInputStream should be included in the try-with-resources block to 
ensure proper resource management. Currently, if an exception occurs between 
line 122 and line 135 (before the try block starts), the BOMInputStream will 
not be closed properly. Additionally, the finally block at lines 201-207 is 
redundant because when BufferedReader (which wraps bomIn) is closed in the 
try-with-resources, it will automatically close the underlying BOMInputStream.



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -189,6 +198,12 @@ public void readProcess(
                             currentFileName);
             throw new FileConnectorException(
                     FileConnectorErrorCode.DATA_DESERIALIZE_FAILED, errorMsg, 
e);
+        } finally {
+            try {
+                bomIn.close();
+            } catch (IOException e) {
+                log.warn("Close BOMInputStream failed for file: {}", 
currentFileName, e);
+            }

Review Comment:
   The finally block that attempts to close BOMInputStream is unnecessary and 
potentially problematic. When the BufferedReader is closed by the 
try-with-resources statement, it automatically closes all underlying streams 
including the BOMInputStream. This finally block creates redundant close 
operations and could mask exceptions from the main try block. Additionally, if 
an exception occurs before the try block starts (e.g., during BOM detection), 
the variable 'bomIn' could be in an inconsistent state, though this is less 
likely with the current implementation.
   ```suggestion
   
   ```



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java:
##########
@@ -141,6 +141,35 @@ public void testSpecialQuoteCharForCsvRead() throws 
Exception {
         }
     }
 
+    @Test
+    public void testUtf8BomCsvWithHeaderRead() throws Exception {
+        URL resource = 
CsvReadStrategyTest.class.getResource("/csv/utf8_bom.csv");
+        String path = Paths.get(resource.toURI()).toString();
+        CsvReadStrategy csvReadStrategy = new CsvReadStrategy();
+        LocalConf localConf = new LocalConf(FS_DEFAULT_NAME_DEFAULT);
+        csvReadStrategy.init(localConf);
+        csvReadStrategy.getFileNamesByPath(path);
+        
csvReadStrategy.setPluginConfig(ConfigFactory.parseMap(getOptionsWithHeader()));
+        csvReadStrategy.setCatalogTable(
+                CatalogTableUtil.getCatalogTable(
+                        "test",
+                        new SeaTunnelRowType(
+                                new String[] {"id", "name", "age"},
+                                new SeaTunnelDataType[] {
+                                    BasicType.INT_TYPE, BasicType.STRING_TYPE, 
BasicType.INT_TYPE
+                                })));

Review Comment:
   The test schema only defines three columns (id, name, age) but the CSV file 
has four columns (id, name, age, gender). While this is intentional to test 
that the reader can handle a subset of columns, it would be more robust to also 
test that all four columns can be read correctly to ensure the BOM handling 
doesn't affect any column, not just the first three. Consider adding an 
assertion that verifies the schema can include all columns from the BOM-encoded 
CSV.



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
                     }
                 }
             }
-            // read lines
+            // read header lines
             List<String> headers = getHeaders(csvParser);
+            // Clean up BOM characters (\ uFEFF) in the header to solve 
occasional BOM residue
+            // issues
+            List<String> cleanedHeaders =
+                    headers.stream()
+                            .map(header -> header.replace("\uFEFF", "").trim())
+                            .collect(Collectors.toList());

Review Comment:
   The header cleaning logic (removing BOM and trimming) is applied to all 
headers regardless of their source. When firstLineAsHeader is false, headers 
come from the catalog table schema (see getHeaders method at line 224), not 
from the file. In this case, cleaning BOM characters and trimming is 
unnecessary and could modify headers incorrectly. The cleaning should only be 
applied when headers are read from the CSV file (i.e., when firstLineAsHeader 
is true). Consider wrapping this cleaning logic in a conditional check.
   ```suggestion
               // Clean up BOM characters (\uFEFF) in the header to solve 
occasional BOM residue
               // issues. This should only be applied when headers come from 
the CSV file
               // (i.e., when firstLineAsHeader is true).
               List<String> cleanedHeaders;
               if (firstLineAsHeader) {
                   cleanedHeaders =
                           headers.stream()
                                   .map(header -> header.replace("\uFEFF", 
"").trim())
                                   .collect(Collectors.toList());
               } else {
                   cleanedHeaders = headers;
               }
   ```



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -44,6 +44,7 @@
 import org.apache.commons.csv.CSVFormat.Builder;
 import org.apache.commons.csv.CSVParser;
 import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.io.input.BOMInputStream;

Review Comment:
   The code imports and uses BOMInputStream from commons-io, but commons-io is 
not explicitly declared as a dependency in the pom.xml for this module. While 
it may be available as a transitive dependency, it's a best practice to 
explicitly declare direct dependencies. This ensures the dependency is 
available and makes the build more maintainable and explicit about what the 
module requires.



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
                     }
                 }
             }
-            // read lines
+            // read header lines
             List<String> headers = getHeaders(csvParser);
+            // Clean up BOM characters (\ uFEFF) in the header to solve 
occasional BOM residue
+            // issues
+            List<String> cleanedHeaders =
+                    headers.stream()
+                            .map(header -> header.replace("\uFEFF", "").trim())

Review Comment:
   The header cleaning logic applies trim() to all headers, which could affect 
headers that intentionally have leading or trailing whitespace. While cleaning 
BOM characters (\uFEFF) is correct, the unconditional trim() might change the 
intended behavior for CSV files where column names legitimately have 
whitespace. Consider only removing the BOM character without trimming, or 
document this behavior change if it's intentional.
   ```suggestion
               // issues, but preserve any intentional leading/trailing 
whitespace
               List<String> cleanedHeaders =
                       headers.stream()
                               .map(header -> header.replace("\uFEFF", ""))
   ```



##########
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java:
##########
@@ -145,15 +146,23 @@ public void readProcess(
                     }
                 }
             }
-            // read lines
+            // read header lines
             List<String> headers = getHeaders(csvParser);
+            // Clean up BOM characters (\ uFEFF) in the header to solve 
occasional BOM residue

Review Comment:
   The comment contains an escaped backslash that makes it harder to read. The 
comment should be "Clean up BOM characters (\uFEFF)" without the extra 
backslash before the 'u'.
   ```suggestion
               // Clean up BOM characters (\uFEFF) in the header to solve 
occasional BOM residue
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to