[ https://issues.apache.org/jira/browse/BEAM-7008?focusedWorklogId=223296&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-223296 ]
ASF GitHub Bot logged work on BEAM-7008: ---------------------------------------- Author: ASF GitHub Bot Created on: 04/Apr/19 22:14 Start Date: 04/Apr/19 22:14 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #8228: [BEAM-7008] standardize UTF-8 string coder encodings URL: https://github.com/apache/beam/pull/8228#discussion_r272387474 ########## File path: sdks/python/apache_beam/coders/coder_impl.py ########## @@ -433,6 +433,18 @@ def decode(self, encoded): return encoded +class StrUtf8CoderImpl(StreamCoderImpl): + """For internal use only; no backwards-compatibility guarantees.""" + def encode_to_stream(self, value, out, nested): + byte_value = value.encode('utf-8') + out.write_var_int64(len(byte_value)) + out.write(byte_value) + + def decode_from_stream(self, in_stream, nested): + byte_length = in_stream.read_var_int64() Review comment: Similarly. Otherwise read to the end of the stream. This can be done with `stream.read_all(nested)`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 223296) Time Spent: 20m (was: 10m) > standardize UTF-8 string coder encodings > ---------------------------------------- > > Key: BEAM-7008 > URL: https://issues.apache.org/jira/browse/BEAM-7008 > Project: Beam > Issue Type: Bug > Components: sdk-java-core, sdk-py-core > Reporter: Heejong Lee > Assignee: Heejong Lee > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > It looks like UTF-8 String Coder in Java and Python SDKs uses different > encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the > input string before actual data bytes however StrUtf8Coder in Python SDK > directly encodes the input string to bytes value. We should unify the > encoding schemes of UTF8 strings across the different SDKs and make it a > standard coder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)