[ https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172658#comment-14172658 ]
Cheolsoo Park commented on PIG-4227: ------------------------------------ {quote} Otherwise we break python udf which do insert tuples. {quote} True, but I hardly see udfs that insert tuples because in Jython, you never had to do that. Since I deployed streaming udf in prod, few users have inserted tuples only because they had to. Now I deployed my patch to prod and haven't heard any complaints. But I do agree that if a udf returns a list of tuples, there will be an extra layer of tuple. That's a valid corner case, indeed. > Streaming Python UDF handles bag outputs incorrectly > ---------------------------------------------------- > > Key: PIG-4227 > URL: https://issues.apache.org/jira/browse/PIG-4227 > Project: Pig > Issue Type: Bug > Reporter: Cheolsoo Park > Assignee: Cheolsoo Park > Fix For: 0.14.0 > > Attachments: PIG-4227-1.patch > > > I have a udf that generates different outputs when running as jython and > streaming python. > {code:title=jython} > {([[BBC Worldwide]])} > {code} > {code:title=streaming python} > {(BC Worldwid)} > {code} > The problem is that streaming python encodes a bag output incorrectly. For > this particular example, it serializes the output string as follows- > {code} > |{_[[BBC Worldwide]]|}_ > {code} > where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' > and '\}' => '|\}\_'. > But this is wrong because bag must contain tuples not chararrays. i.e. the > correct encoding is as follows- > {code} > |{_|(_[[BBC Worldwide]]|)_|}_ > {code} > where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters. > This results in truncated outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)