[jira] [Assigned] (FLINK-10134) UTF-16 support for TextInputFormat

2018-09-28 Thread lihongli (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lihongli reassigned FLINK-10134:


Assignee: (was: lihongli)

> UTF-16 support for TextInputFormat
> --
>
> Key: FLINK-10134
> URL: https://issues.apache.org/jira/browse/FLINK-10134
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.4.2
>Reporter: David Dreyfus
>Priority: Blocker
>  Labels: pull-request-available
>
> It does not appear that Flink supports a charset encoding of "UTF-16". It 
> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) 
> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE.
>  
> TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), 
> which sets TextInputFormat.charsetName and then modifies the previously set 
> delimiterString to construct the proper byte string encoding of the the 
> delimiter. This same charsetName is also used in TextInputFormat.readRecord() 
> to interpret the bytes read from the file.
>  
> There are two problems that this implementation would seem to have when using 
> UTF-16.
>  # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will 
> return a Big Endian byte sequence including the Byte Order Mark (BOM). The 
> actual text file will not contain a BOM at each line ending, so the delimiter 
> will never be read. Moreover, if the actual byte encoding of the file is 
> Little Endian, the bytes will be interpreted incorrectly.
>  # TextInputFormat.readRecord() will not see a BOM each time it decodes a 
> byte sequence with the String(bytes, offset, numBytes, charset) call. 
> Therefore, it will assume Big Endian, which may not always be correct. [1] 
> [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]
>  
> While there are likely many solutions, I would think that all of them would 
> have to start by reading the BOM from the file when a Split is opened and 
> then using that BOM to modify the specified encoding to a BOM specific one 
> when the caller doesn't specify one, and to overwrite the caller's 
> specification if the BOM is in conflict with the caller's specification. That 
> is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, 
> Flink should rewrite the charsetName as UTF-16LE.
>  I hope this makes sense and that I haven't been testing incorrectly or 
> misreading the code.
>  
> I've verified the problem on version 1.4.2. I believe the problem exists on 
> all versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10134) UTF-16 support for TextInputFormat

2018-09-28 Thread lihongli (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lihongli reassigned FLINK-10134:


Assignee: lihongli

> UTF-16 support for TextInputFormat
> --
>
> Key: FLINK-10134
> URL: https://issues.apache.org/jira/browse/FLINK-10134
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.4.2
>Reporter: David Dreyfus
>Assignee: lihongli
>Priority: Blocker
>  Labels: pull-request-available
>
> It does not appear that Flink supports a charset encoding of "UTF-16". It 
> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) 
> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE.
>  
> TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(), 
> which sets TextInputFormat.charsetName and then modifies the previously set 
> delimiterString to construct the proper byte string encoding of the the 
> delimiter. This same charsetName is also used in TextInputFormat.readRecord() 
> to interpret the bytes read from the file.
>  
> There are two problems that this implementation would seem to have when using 
> UTF-16.
>  # delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will 
> return a Big Endian byte sequence including the Byte Order Mark (BOM). The 
> actual text file will not contain a BOM at each line ending, so the delimiter 
> will never be read. Moreover, if the actual byte encoding of the file is 
> Little Endian, the bytes will be interpreted incorrectly.
>  # TextInputFormat.readRecord() will not see a BOM each time it decodes a 
> byte sequence with the String(bytes, offset, numBytes, charset) call. 
> Therefore, it will assume Big Endian, which may not always be correct. [1] 
> [https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]
>  
> While there are likely many solutions, I would think that all of them would 
> have to start by reading the BOM from the file when a Split is opened and 
> then using that BOM to modify the specified encoding to a BOM specific one 
> when the caller doesn't specify one, and to overwrite the caller's 
> specification if the BOM is in conflict with the caller's specification. That 
> is, if the BOM indicates Little Endian and the caller indicates UTF-16BE, 
> Flink should rewrite the charsetName as UTF-16LE.
>  I hope this makes sense and that I haven't been testing incorrectly or 
> misreading the code.
>  
> I've verified the problem on version 1.4.2. I believe the problem exists on 
> all versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10355) The order of the column should start from 1.

2018-09-19 Thread lihongli (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620643#comment-16620643
 ] 

lihongli commented on FLINK-10355:
--

[~hequn8128] I insert a wrong row into oracle.It  prints the whole row.The 
oracle indicates the wrong column with a * below it.

> The order of the column should start from 1.
> 
>
> Key: FLINK-10355
> URL: https://issues.apache.org/jira/browse/FLINK-10355
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Affects Versions: 1.6.0
>Reporter: lihongli
>Priority: Major
>  Labels: easyfix, pull-request-available
> Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png
>
>
> When  I register an external Table using a CsvTableSource.It throws an 
> exception :"Parsing error for column 1".But I finally found that the second 
> column is the error column.I think that the order of the column should start 
> from 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10355) The order of the column should start from 1.

2018-09-18 Thread lihongli (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619147#comment-16619147
 ] 

lihongli commented on FLINK-10355:
--

[~hequn8128] Yes,I agree with you.And I think that the columns start from 1 is 
more user-friendly.

> The order of the column should start from 1.
> 
>
> Key: FLINK-10355
> URL: https://issues.apache.org/jira/browse/FLINK-10355
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Affects Versions: 1.6.0
>Reporter: lihongli
>Priority: Major
>  Labels: easyfix, pull-request-available
> Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png
>
>
> When  I register an external Table using a CsvTableSource.It throws an 
> exception :"Parsing error for column 1".But I finally found that the second 
> column is the error column.I think that the order of the column should start 
> from 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10355) The order of the column should start from 1.

2018-09-17 Thread lihongli (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lihongli updated FLINK-10355:
-
Attachment: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png

> The order of the column should start from 1.
> 
>
> Key: FLINK-10355
> URL: https://issues.apache.org/jira/browse/FLINK-10355
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: lihongli
>Priority: Major
> Attachments: B0C32FD9-47FE-4F63-921F-A9E49C0CB5CD.png
>
>
> When  I register an external Table using a CsvTableSource.It throws an 
> exception :"Parsing error for column 1".But I finally found that the second 
> column is the error column.I think that the order of the column should start 
> from 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10355) The order of the column should start from 1.

2018-09-17 Thread lihongli (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lihongli updated FLINK-10355:
-
Description: When  I register an external Table using a CsvTableSource.It 
throws an exception :"Parsing error for column 1".But I finally found that the 
second column is the error column.I think that the order of the column should 
start from 1.  (was: When  I register an external Table using a 
CsvTableSource.It throws an exception :"Parsing error for column 1".But I 
finally found that the second column is the error column.I think that the order 
of the column should start from 1.

!image-2018-09-17-21-57-55-038.png!)

> The order of the column should start from 1.
> 
>
> Key: FLINK-10355
> URL: https://issues.apache.org/jira/browse/FLINK-10355
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: lihongli
>Priority: Major
>
> When  I register an external Table using a CsvTableSource.It throws an 
> exception :"Parsing error for column 1".But I finally found that the second 
> column is the error column.I think that the order of the column should start 
> from 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10355) The order of the column should start from 1.

2018-09-17 Thread lihongli (JIRA)
lihongli created FLINK-10355:


 Summary: The order of the column should start from 1.
 Key: FLINK-10355
 URL: https://issues.apache.org/jira/browse/FLINK-10355
 Project: Flink
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.6.0
Reporter: lihongli


When  I register an external Table using a CsvTableSource.It throws an 
exception :"Parsing error for column 1".But I finally found that the second 
column is the error column.I think that the order of the column should start 
from 1.

!image-2018-09-17-21-57-55-038.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)