Mike Sukmanowsky created PIG-4240: ------------------------------------- Summary: Python UDF output serialization is invalid Key: PIG-4240 URL: https://issues.apache.org/jira/browse/PIG-4240 Project: Pig Issue Type: Bug Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.12.2, 0.13.1 Reporter: Mike Sukmanowsky
The [serialize_output function|https://github.com/apache/pig/blob/trunk/src/python/streaming/controller.py#L306-L339] handles {{str}} and {{unicode}} return types from UDFs inappropriately. Namely, a function could return a valid UTF-8 string, but the call to unicode will trigger a {{decode}} using the default Python encoding, ascii which can cause {{UnicodeDecodeError}}s. Example: {code:none} #!/usr/bin/env python # -*- coding: utf-8 -*- # pig_test.py s = "¼½¾" # valid utf-8 string print unicode(s, 'utf-8') # works print unicode(s) # triggers UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) {code} I propose making a few changes to {{serialize_output}}: * Since ASCII (Python 2.x's default encoding) is a proper subset of UTF-8, there's no need to check for {{str}} being UTF-8 encoded, so we should remove checking where {{output_type == str}} * Add a check for {{output_type == unicode}} that ensures utf-8 via {{output.encode('utf-8')}} Patch forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)