[
https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vincent BARAT updated PIG-2271:
-------------------------------
Description:
I'm using the 0.9.1 official release.
My input data are read form a text file 'activity' (provided as attachment):
{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}
My first script is working correctly:
{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
-- store grouped activities in a temporary file
STORE activities INTO 'tmp1' USING PigStorage();
-- reload grouped activities from the temporary file
activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag {
act:tuple (timestamp:long, name:chararray) });
-- store grouped activities again in another temporary file
STORE activities INTO 'tmp2' USING PigStorage();
{code}
The issue occurs when I use BinStorage() or PigStorage(',') instead of
PigStorage() to store / reload my temporary files.
was:
I'm using the 0.9.1 official release.
I've an UDF function that takes a bag as input:
{code}
public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(0);
...
{code}
My input data are read form a text file 'activity' (same issue when they are
read from HBase):
{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}
My first script is working correctly:
{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, MyUDF(activities.(timestamp,
name));
store activities into 'output';
{code}
N.B. the name of the first activity is correctly set to null in my UDF function.
The issue occurs when I store my data into a binary file are reload them before
processing (I do this to improve the computation time, since HDFS is much
faster than HBase).
Second script that triggers an error (this script work correctly with PIG
0.8.1):
{code}
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities into 'output';
{code}
In this script, when MyUDF is called, activityBag is null, and a warning is
issued:
{code}
2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast:
Unable to interpret value {(1239698069000,)} in field being converted to type
bag, caught ParseException <Cannot convert (1239698069000,) to
null:(timestamp:long,name:chararray)> field discarded
{code}
I guess that the regression is located into BinStorage...
Priority: Critical (was: Major)
Summary: PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
(was: PIG regression (in BinStorage?) between 0.8.1 and 0.9.x)
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
> Key: PIG-2271
> URL: https://issues.apache.org/jira/browse/PIG-2271
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.1
> Reporter: Vincent BARAT
> Priority: Critical
> Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My first script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> timestamp:long, name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp1' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag {
> act:tuple (timestamp:long, name:chararray) });
> -- store grouped activities again in another temporary file
> STORE activities INTO 'tmp2' USING PigStorage();
> {code}
> The issue occurs when I use BinStorage() or PigStorage(',') instead of
> PigStorage() to store / reload my temporary files.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira