[GitHub] XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed
XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed URL: https://github.com/apache/flink/pull/6823#discussion_r227739035 ## File path: flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java ## @@ -472,6 +498,7 @@ public void open(FileInputSplit split) throws IOException { this.offset = splitStart; if (this.splitStart != 0) { + setBomFileCharset(split); Review comment: @fhueske Adding `FileInputFormat.readFileHeader()` to `FileInputFormat` still needs to get the 4 bytes of the bom header through the stream. I think it's okay to open the `stream` in `DelimitedInputFormat` and then process it. Also for the Stream of `InputStreamFSInputWrapper`'s I need to open and read 4 bytes and then close the stream. `But fillBuffer(0)` will also do the open and close operations of the stream. This is my question. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed
XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed URL: https://github.com/apache/flink/pull/6823#discussion_r226598262 ## File path: flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java ## @@ -472,6 +498,7 @@ public void open(FileInputSplit split) throws IOException { this.offset = splitStart; if (this.splitStart != 0) { + setBomFileCharset(split); Review comment: `DelimitedInputFormat` is filled with a `readBuffer` variable when calling `fillBuffer(0)`. We can copy 4 bytes from readBuffer. It should not have a big impact. As follows: `System.arraycopy(this.readBuffer, 0, bomBuffer, 0, 4);` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed
XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed URL: https://github.com/apache/flink/pull/6823#discussion_r226516362 ## File path: flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java ## @@ -472,6 +498,7 @@ public void open(FileInputSplit split) throws IOException { this.offset = splitStart; if (this.splitStart != 0) { + setBomFileCharset(split); Review comment: I have two questions about this commit, as follows: For the first suggestion, I feel that users often cannot know the encoding of the file accurately. For example: file encoding `UTF-16LE`, with bom header, user-specified encoding `UTF-16BE` will report an error. And there is bom UTF with bom encoding I believe will be the majority. So I think it is necessary to do the bom code detection first, which is better for the user experience. For the fourth recommendation, the seek of `GenericCsvInputFormat` cannot be seek to position 0. It calls the `seek` method of `InputStreamFSInputWrapper`. This method cannot currently seek to position 0. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed
XuQianJin-Stars commented on a change in pull request #6823: [FLINK-10134] UTF-16 support for TextInputFormat bug refixed URL: https://github.com/apache/flink/pull/6823#discussion_r226516362 ## File path: flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java ## @@ -472,6 +498,7 @@ public void open(FileInputSplit split) throws IOException { this.offset = splitStart; if (this.splitStart != 0) { + setBomFileCharset(split); Review comment: I have two questions about this commit, as follows: For the first suggestion, I feel that users often cannot know the encoding of the file accurately. For example: file encoding `UTF-16LE`, with bom header, user-specified encoding `UTF-16BE` will report an error. And there is bom UTF encoding I believe will be the majority. So I think it is necessary to do the bom code detection first, which is better for the user experience. For the fourth recommendation, the seek of `GenericCsvInputFormat` cannot be seek to position 0. It calls the `seek` method of `InputStreamFSInputWrapper`. This method cannot currently seek to position 0. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services