[
https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101714#comment-13101714
]
Daniel Dai commented on PIG-2271:
---------------------------------
Can you do these:
1. Get the output schema for MyUDF. (describe activities)
2. Use a different construct for BinStorage:
BinStorage("org.apache.pig.builtin.Utf8StorageConverter")
> PIG regression (in BinStorage?) between 0.8.1 and 0.9.x
> -------------------------------------------------------
>
> Key: PIG-2271
> URL: https://issues.apache.org/jira/browse/PIG-2271
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.0
> Reporter: Vincent BARAT
>
> I'm using the 0.9.x branch (tested at 2011-09-07).
> I've an UDF function that takes a bag as input:
> {code}
> public DataBag exec(Tuple input) throws IOException
> {
> /* Get the activity bag */
> DataBag activityBag = (DataBag) input.get(0);
> ...
> {code}
> My input data are read form a text file 'activity' (same issue when they are
> read from HBase):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp,
> name));
> store activities;
> {code}
> N.B. the name of the first activity is correctly set to null in my UDF
> function.
> The issue occurs when I store my data into a binary file are reload them
> before processing (I do this to improve the computation time, since HDFS is
> much faster than HBase).
> Second script that triggers an error (this script work correctly with PIG
> 0.8.1):
> {code}
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> STORE activities INTO 'activities' USING BinStorage;
> activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> activities:bag { activity: (timestamp:long, name:chararray) });
> activities = FOREACH activities GENERATE sid, MyUDF(activities);
> store activities;
> {code}
> In this script, when MyUDF is called, activityBag is null, and a warning is
> issued:
> {code}
> 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast:
> Unable to interpret value {(1239698069000,)} in field being converted to
> type bag, caught ParseException <Cannot convert (1239698069000,) to
> null:(timestamp:long,name:chararray)> field discarded
> {code}
> I guess that the regression is located into BinStorage...
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira