Bruno Pusztahazi created PIG-5404:
-------------------------------------
Summary: FLATTEN infers wrong datatype
Key: PIG-5404
URL: https://issues.apache.org/jira/browse/PIG-5404
Project: Pig
Issue Type: Bug
Components: piggybank
Affects Versions: 0.17.0
Reporter: Bruno Pusztahazi
In version 0.12 (checked out branch-0.12) the following code works as expected:
With the following input file test.csv:
{code:java}
John_5,18,4.0F
Mary_6,19,3.8F
Bill_7,20,3.9F
Joe_8,18,3.8F{code}
{code:java}
A = LOAD 'test.csv' USING PigStorage (',') AS
(name:chararray,age:int,gpr:float);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(name,'_')) as
(name1:chararray,name2:chararray),age,gpr;
DESCRIBE B;{code}
and produces the following output:
{code:java}
B: {name1: chararray,name2: chararray,age: int,gpr: float}
{code}
This is the expected output as the result of flatten is defined as chararrays.
When using version 0.17 (checkout out branch-0.17) the code produces:
{code:java}
B: {name1: bytearray,name2: bytearray,age: int,gpr: float}
{code}
This shows that somehow FLATTEN inferred wrong data types (bytearray instead of
chararay).
Using explicit casting as a workaround on 0.17:
{code:java}
B1 = FOREACH B GENERATE (chararray)name1,(chararray)name2,age,gpr;
DESCRIBE B1;{code}
produces
{code:java}
B1: {name1: chararray,name2: chararray,age: int,gpr: float}
{code}
This time with the expected data types.
The plan explain show some strange cast operators that are not really used (or
at least the actual data types are wrong):
{code:java}
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
B: (Name: LOStore Schema:
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
|
|---B: (Name: LOForEach Schema:
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)
| |
| (Name: LOGenerate[false,false,false,false] Schema:
name1#121:chararray,name2#122:chararray,age#105:int,gpr#106:float)ColumnPrune:OutputUids=[121,
105, 122, 106]ColumnPrune:InputUids=[121, 105, 122, 106]
| | |
| | (Name: Cast Type: chararray Uid: 121)
| | |
| | |---name1:(Name: Project Type: bytearray Uid: 121 Input: 0 Column:
0)
| | |
| | (Name: Cast Type: chararray Uid: 122)
| | |
| | |---name2:(Name: Project Type: bytearray Uid: 122 Input: 1 Column:
0)
| | |
| | age:(Name: Project Type: int Uid: 105 Input: 2 Column: 0)
| | |
| | gpr:(Name: Project Type: float Uid: 106 Input: 3 Column: 0)
| |
| |---(Name: LOInnerLoad[0] Schema: name1#121:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: name2#122:bytearray)
| |
| |---(Name: LOInnerLoad[2] Schema: age#105:int)
| |
| |---(Name: LOInnerLoad[3] Schema: gpr#106:float)
|
|---B: (Name: LOForEach Schema:
name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
| |
| (Name: LOGenerate[true,false,false] Schema:
name1#135:bytearray,name2#136:bytearray,age#105:int,gpr#106:float)
| | |
| | (Name: UserFunc(org.apache.pig.builtin.STRSPLIT) Type: tuple
Uid: 132)
| | |
| | |---(Name: Cast Type: chararray Uid: 104)
| | | |
| | | |---name:(Name: Project Type: bytearray Uid: 104 Input: 0
Column: (*))
| | |
| | |---(Name: Constant Type: chararray Uid: 131)
| | |
| | (Name: Cast Type: int Uid: 105)
| | |
| | |---age:(Name: Project Type: bytearray Uid: 105 Input: 1
Column: (*))
| | |
| | (Name: Cast Type: float Uid: 106)
| | |
| | |---gpr:(Name: Project Type: bytearray Uid: 106 Input: 2
Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: name#104:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: age#105:bytearray)
| |
| |---(Name: LOInnerLoad[2] Schema: gpr#106:bytearray)
|
|---A: (Name: LOLoad Schema:
name#104:bytearray,age#105:bytearray,gpr#106:bytearray)RequiredFields:null
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)