PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
-------------------------------------------------------
Key: PIG-2271
URL: https://issues.apache.org/jira/browse/PIG-2271
Project: Pig
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Vincent BARAT
I'm using the 0.9.x branch (tested at 2011-09-07).
I've an UDF function that takes a bag as input:
{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}
My input data are read form a text file 'activity' (same issue when they are
read from HBase):
{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}
My first script is working correctly:
{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp,
name));
store activities;
{code}
N.B. the name of the first activity is correctly set to null in my UDF function.
The issue occurs when I store my data into a binary file are reload them before
processing (I do this to improve the computation time, since HDFS is much
faster than HBase).
Second script that triggers an error (this script work correctly with PIG
0.8.1):
{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities;
{code}
In this script, when MyUDF is called, activityBag is null, and a warning is
issued:
{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast:
Unable to interpret value {(1239698069000,)} in field being converted to type
bag, caught ParseException <Cannot convert (1239698069000,) to
null:(timestamp:long,name:chararray)> field discarded
{code}
I guess that the regression is located into BinStorage...
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira