[jira] [Assigned] (FLINK-10134) UTF-16 support for TextInputFormat
[ https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lihongli reassigned FLINK-10134: Assignee: (was: lihongli) > UTF-16 support for TextInputFormat > -- > > Key: FLINK-10134 > URL: https://issues.apache.org/jira/browse/FLINK-10134 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.4.2 >Reporter: David Dreyfus >Priority: Blocker > Labels: pull-request-available > > It does not appear that Flink supports a charset encoding of "UTF-16". It > particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) > to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. > > TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), > which sets TextInputFormat.charsetName and then modifies the previously set > delimiterString to construct the proper byte string encoding of the the > delimiter. This same charsetName is also used in TextInputFormat.readRecord() > to interpret the bytes read from the file. > > There are two problems that this implementation would seem to have when using > UTF-16. > # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will > return a Big Endian byte sequence including the Byte Order Mark (BOM). The > actual text file will not contain a BOM at each line ending, so the delimiter > will never be read. Moreover, if the actual byte encoding of the file is > Little Endian, the bytes will be interpreted incorrectly. > # TextInputFormat.readRecord() will not see a BOM each time it decodes a > byte sequence with the String(bytes, offset, numBytes, charset) call. > Therefore, it will assume Big Endian, which may not always be correct. [1] > [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95] > > While there are likely many solutions, I would think that all of them would > have to start by reading the BOM from the file when a Split is opened and > then using that BOM to modify the specified encoding to a BOM specific one > when the caller doesn't specify one, and to overwrite the caller's > specification if the BOM is in conflict with the caller's specification. That > is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, > Flink should rewrite the charsetName as UTF-16LE. > I hope this makes sense and that I haven't been testing incorrectly or > misreading the code. > > I've verified the problem on version 1.4.2. I believe the problem exists on > all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-10134) UTF-16 support for TextInputFormat
[ https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lihongli reassigned FLINK-10134: Assignee: lihongli > UTF-16 support for TextInputFormat > -- > > Key: FLINK-10134 > URL: https://issues.apache.org/jira/browse/FLINK-10134 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.4.2 >Reporter: David Dreyfus >Assignee: lihongli >Priority: Blocker > Labels: pull-request-available > > It does not appear that Flink supports a charset encoding of "UTF-16". It > particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) > to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. > > TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), > which sets TextInputFormat.charsetName and then modifies the previously set > delimiterString to construct the proper byte string encoding of the the > delimiter. This same charsetName is also used in TextInputFormat.readRecord() > to interpret the bytes read from the file. > > There are two problems that this implementation would seem to have when using > UTF-16. > # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will > return a Big Endian byte sequence including the Byte Order Mark (BOM). The > actual text file will not contain a BOM at each line ending, so the delimiter > will never be read. Moreover, if the actual byte encoding of the file is > Little Endian, the bytes will be interpreted incorrectly. > # TextInputFormat.readRecord() will not see a BOM each time it decodes a > byte sequence with the String(bytes, offset, numBytes, charset) call. > Therefore, it will assume Big Endian, which may not always be correct. [1] > [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95] > > While there are likely many solutions, I would think that all of them would > have to start by reading the BOM from the file when a Split is opened and > then using that BOM to modify the specified encoding to a BOM specific one > when the caller doesn't specify one, and to overwrite the caller's > specification if the BOM is in conflict with the caller's specification. That > is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, > Flink should rewrite the charsetName as UTF-16LE. > I hope this makes sense and that I haven't been testing incorrectly or > misreading the code. > > I've verified the problem on version 1.4.2. I believe the problem exists on > all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10355) The order of the column should start from 1.
[ https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620643#comment-16620643 ] lihongli commented on FLINK-10355: -- [~hequn8128] I insert a wrong row into oracle.It prints the whole row.The oracle indicates the wrong column with a * below it. > The order of the column should start from 1. > > > Key: FLINK-10355 > URL: https://issues.apache.org/jira/browse/FLINK-10355 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Affects Versions: 1.6.0 >Reporter: lihongli >Priority: Major > Labels: easyfix, pull-request-available > Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png > > > When I register an external Table using a CsvTableSource.It throws an > exception :"Parsing error for column 1".But I finally found that the second > column is the error column.I think that the order of the column should start > from 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10355) The order of the column should start from 1.
[ https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619147#comment-16619147 ] lihongli commented on FLINK-10355: -- [~hequn8128] Yes,I agree with you.And I think that the columns start from 1 is more user-friendly. > The order of the column should start from 1. > > > Key: FLINK-10355 > URL: https://issues.apache.org/jira/browse/FLINK-10355 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Affects Versions: 1.6.0 >Reporter: lihongli >Priority: Major > Labels: easyfix, pull-request-available > Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png > > > When I register an external Table using a CsvTableSource.It throws an > exception :"Parsing error for column 1".But I finally found that the second > column is the error column.I think that the order of the column should start > from 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10355) The order of the column should start from 1.
[ https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lihongli updated FLINK-10355: - Attachment: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png > The order of the column should start from 1. > > > Key: FLINK-10355 > URL: https://issues.apache.org/jira/browse/FLINK-10355 > Project: Flink > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 >Reporter: lihongli >Priority: Major > Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png > > > When I register an external Table using a CsvTableSource.It throws an > exception :"Parsing error for column 1".But I finally found that the second > column is the error column.I think that the order of the column should start > from 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10355) The order of the column should start from 1.
[ https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lihongli updated FLINK-10355: - Description: When I register an external Table using a CsvTableSource.It throws an exception :"Parsing error for column 1".But I finally found that the second column is the error column.I think that the order of the column should start from 1. (was: When I register an external Table using a CsvTableSource.It throws an exception :"Parsing error for column 1".But I finally found that the second column is the error column.I think that the order of the column should start from 1. !image-2018-09-17-21-57-55-038.png!) > The order of the column should start from 1. > > > Key: FLINK-10355 > URL: https://issues.apache.org/jira/browse/FLINK-10355 > Project: Flink > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 >Reporter: lihongli >Priority: Major > > When I register an external Table using a CsvTableSource.It throws an > exception :"Parsing error for column 1".But I finally found that the second > column is the error column.I think that the order of the column should start > from 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10355) The order of the column should start from 1.
lihongli created FLINK-10355: Summary: The order of the column should start from 1. Key: FLINK-10355 URL: https://issues.apache.org/jira/browse/FLINK-10355 Project: Flink Issue Type: Improvement Components: Java API Affects Versions: 1.6.0 Reporter: lihongli When I register an external Table using a CsvTableSource.It throws an exception :"Parsing error for column 1".But I finally found that the second column is the error column.I think that the order of the column should start from 1. !image-2018-09-17-21-57-55-038.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)