[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage()
[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669097#action_12669097 ] Laukik Chitnis commented on PIG-560: In the current patch, when the length is 65536, the string to UTF8 conversion is happening twice -- once with String::getBytes() and once with DataOutput::writeUTF() To avoid that, instead of writeUTF(), how about using writeShort() followed by writeBytes() since we would already have the length and the UTF8 bytes? UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage() --- Key: PIG-560 URL: https://issues.apache.org/jira/browse/PIG-560 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Fix For: types_branch Attachments: PIG-560.patch, utf-limit-patch.diff BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[] (using String.getBytes(UTF-8) and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage()
[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669099#action_12669099 ] Laukik Chitnis commented on PIG-560: In the current patch, when the length is 65536, the string to UTF8 conversion is happening twice -- once with String::getBytes() and once with DataOutput::writeUTF() Instead of writeUTF(), how about using writeShort() followed by writeBytes() since we would already have the length and the UTF8 bytes? UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage() --- Key: PIG-560 URL: https://issues.apache.org/jira/browse/PIG-560 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Fix For: types_branch Attachments: PIG-560.patch, utf-limit-patch.diff BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[] (using String.getBytes(UTF-8) and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage()
[ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668682#action_12668682 ] Alan Gates commented on PIG-560: I'm concerned here that we're adding 2 bytes to every string we store for a case which should be quite rare (how often to people have strings longer than 64K?) Would it be better to have bin storage define a long string type that uses 4 bytes to encode it's length, and then test a string's length before writing it out and leave things as they are now for most strings and use the new long string for anything over 64K? UTFDataFormatException (encoded string too long) is thrown when storing strings 65536 bytes (in UTF8 form) using BinStorage() --- Key: PIG-560 URL: https://issues.apache.org/jira/browse/PIG-560 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Fix For: types_branch Attachments: utf-limit-patch.diff BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[] (using String.getBytes(UTF-8) and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.