Issue reported:

https://issues.apache.org/jira/browse/PIG-2271

Le 07/09/11 20:52, Kevin Burton a écrit :
I believe that everything is byte array at first but I may be wrong… at
least this has been the situation in my experiments.

It is best to always specify schema though.  Unless you're using Zebra which
stores the schema directly (which is very handy btw).

You could also try InterStorage (which you can use directly via the full
classname) as it is more efficient if I recall correctly.

While it probably would be nice for you to submit a bug and of course you
can wait until it is fixed, it's probably faster for you to just work around
it…

Kevin

On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes<cor...@tynt.com>  wrote:

Hi there,

I think we might be seeing something related to this problem and can
confirm
it's in BinStorage for us.

We stored referrer_stats_by_site using BinStorage.  Here is a describe of
the alias:
referrer_stats_by_site: {site: chararray,{(referrerdomain:
chararray,lcnt:
long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}

Now we try to load that data:
referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
tcnt:long,
referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
tcnt:long)})});

but when we do we cannot find a certain 'site'.

When we don't provide the schema:
referrers = LOAD 'mydata' USING BinStorage();

It will load but referrerdomain is a bytearray instead of chararray.  Is
pig
supposed to automatically cast this to a chararray for me?  Is there any
reason why this data won't load unless we change the type to bytearray?


On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan<hashut...@apache.org
wrote:
Vincent,

Thanks for your hard work in isolating the bug. Its a perfect bug report.
Seems like its a regression. Can you please open a jira with test data
and
script (which works in 0.8.1 and fails in 0.9)

Ashutosh

On Wed, Sep 7, 2011 at 07:17, Vincent Barat<vincent.ba...@gmail.com>
wrote:

Hi,

I really need your help on this one! I've worked hard to isolate the
regression.
I'm using the 0.9.x branch (tested at 2011-09-07).

I've an UDF function that takes a bag as input:

public DataBag exec(Tuple input) throws IOException
{
/* Get the activity bag */
DataBag activityBag = (DataBag) input.get(2);
…

My input data are read form a text file 'activity' (same issue when
they
are read from HBase):
00,1239698069000,<- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c

My first script is working correctly:

activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group,
MyUDF(activities.(timestamp, name));
store activities;

N.B. the name of the first activity is correctly set to null in my UDF
function.

The issue occurs when I store my data into a binary file are relaod
them
before processing (I do this to improve the computation time, since
HDFS
is
much faster than HBase).

Second script that triggers an error (this script work correctly with
PIG
0.8.1):

activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
timestamp:long, name:chararray);
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp,
name);
STORE activities INTO 'activities' USING BinStorage;
activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
activities:bag { activity: (timestamp:long, name:chararray) });
activities = FOREACH activities GENERATE sid, MyUDF(activities);
store activities;

In this script, when MyUDF is calles, activityBag is null, and a
warning
is
issued:

2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |

org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
Unable to interpret value {(1239698069000,)} in field being converted
to
type bag, caught ParseException<Cannot convert (1239698069000,) to
null:(timestamp:long,name:**chararray)>  field discarded

I guess that the regression is located into BinStorage

Le 30/08/11 19:13, Daniel Dai a écrit :

Interesting, the log message seems to be clear, "Cannot convert
(1239698069000,) to null:(timestamp:long,name:**chararray)", but I
cannot find an explanation to that. I verified such conversion should
be valid on 0.9. Can you show me the script?

Daniel

On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
vincent.ba...@gmail.com>
  wrote:

Hi,

I have experienced the same issue by loading the data from raw text
files
(using PIG server in local mode and the regular PIG loader) and from
HBaseStorage.
The issue is exactly the same in both cases: each time a NULL string
is
encountered, the cast to a data bag cannot be done.

Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :

How are you loading this data?

D

On Mon, Aug 29, 2011 at 8:05 AM, Vincent
Barat<vincent.ba...@gmail.com>**wrote:

  I'm currently testing PIG 0.9.x branch.
Several of my jobs that use to work correctly with PIG 0.8.1 now
fail
due
to a cast error returning a null pointer in one of my UDF function.

Apparently, PIG seems to be unable to convert some data to a bag
when
some
of the tuple fields are null:

2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |


org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
expressionOperators.POCast:
Unable to interpret value {(1239698069000,)} in field being
converted
to
type bag, caught ParseException<Cannot convert (1239698069000,) to
null:(timestamp:long,name:****chararray)>    field discarded
2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
job_local_0019

My UDF functions is:

  /**
   *...
   * @param start start of the session (in milliseconds since epoch)
   * @param end end of the session (in milliseconds since epoch)
   * @param activities a bag containing a set of activities in the
form
of
a
set of (timestamp:long,
   *          name:chararray) tuples
   * ...
   */
  public DataBag exec(Tuple input) throws IOException
  {
    /* Get session's start/end timestamps */
    long startSession = (Long) input.get(0);
    long endSession = (Long) input.get(1);

    /* Get the activity bag */
    DataBag activityBag = (DataBag) input.get(2);

                                     ^  here


Is that a regression ? Any idea to fix this ?

Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)




Reply via email to