I believe that everything is byte array at first but I may be wrong… at
least this has been the situation in my experiments.

It is best to always specify schema though.  Unless you're using Zebra which
stores the schema directly (which is very handy btw).

You could also try InterStorage (which you can use directly via the full
classname) as it is more efficient if I recall correctly.

While it probably would be nice for you to submit a bug and of course you
can wait until it is fixed, it's probably faster for you to just work around
it…

Kevin

On Wed, Sep 7, 2011 at 11:47 AM, Corbin Hoenes <cor...@tynt.com> wrote:

> Hi there,
>
> I think we might be seeing something related to this problem and can
> confirm
> it's in BinStorage for us.
>
> We stored referrer_stats_by_site using BinStorage.  Here is a describe of
> the alias:
> > referrer_stats_by_site: {site: chararray,{(referrerdomain:
> chararray,lcnt:
> long,tcnt: long,{(referrer: chararray,lcnt: long,tcnt: long)})}}
>
> Now we try to load that data:
> referrers = LOAD 'mydata' USING BinStorage() AS (site:chararray,
> referrerdomainlist:bag{t:tuple(referrerdomain:chararray, lcnt:long,
> tcnt:long,
> referrerurllist:bag{t1:tuple(referrerurl:chararray, lcnt:long,
> tcnt:long)})});
>
> but when we do we cannot find a certain 'site'.
>
> When we don't provide the schema:
> referrers = LOAD 'mydata' USING BinStorage();
>
> It will load but referrerdomain is a bytearray instead of chararray.  Is
> pig
> supposed to automatically cast this to a chararray for me?  Is there any
> reason why this data won't load unless we change the type to bytearray?
>
>
> On Wed, Sep 7, 2011 at 9:15 AM, Ashutosh Chauhan <hashut...@apache.org
> >wrote:
>
> > Vincent,
> >
> > Thanks for your hard work in isolating the bug. Its a perfect bug report.
> > Seems like its a regression. Can you please open a jira with test data
> and
> > script (which works in 0.8.1 and fails in 0.9)
> >
> > Ashutosh
> >
> > On Wed, Sep 7, 2011 at 07:17, Vincent Barat <vincent.ba...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I really need your help on this one! I've worked hard to isolate the
> > > regression.
> > > I'm using the 0.9.x branch (tested at 2011-09-07).
> > >
> > > I've an UDF function that takes a bag as input:
> > >
> > > public DataBag exec(Tuple input) throws IOException
> > > {
> > > /* Get the activity bag */
> > > DataBag activityBag = (DataBag) input.get(2);
> > > …
> > >
> > > My input data are read form a text file 'activity' (same issue when
> they
> > > are read from HBase):
> > > 00,1239698069000, <- this is the line that is not correctly handled
> > > 01,1239698505000,b
> > > 01,1239698369000,a
> > > 02,1239698413000,b
> > > 02,1239698553000,c
> > > 02,1239698313000,a
> > > 03,1239698316000,a
> > > 03,1239698516000,c
> > > 03,1239698416000,b
> > > 03,1239698621000,d
> > > 04,1239698417000,c
> > >
> > > My first script is working correctly:
> > >
> > > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > > timestamp:long, name:chararray);
> > > activities = GROUP activities BY sid;
> > > activities = FOREACH activities GENERATE group,
> > > MyUDF(activities.(timestamp, name));
> > > store activities;
> > >
> > > N.B. the name of the first activity is correctly set to null in my UDF
> > > function.
> > >
> > > The issue occurs when I store my data into a binary file are relaod
> them
> > > before processing (I do this to improve the computation time, since
> HDFS
> > is
> > > much faster than HBase).
> > >
> > > Second script that triggers an error (this script work correctly with
> PIG
> > > 0.8.1):
> > >
> > > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray,
> > > timestamp:long, name:chararray);
> > > activities = GROUP activities BY sid;
> > > activities = FOREACH activities GENERATE group, activities.(timestamp,
> > > name);
> > > STORE activities INTO 'activities' USING BinStorage;
> > > activities = LOAD 'activities' USING BinStorage AS (sid:chararray,
> > > activities:bag { activity: (timestamp:long, name:chararray) });
> > > activities = FOREACH activities GENERATE sid, MyUDF(activities);
> > > store activities;
> > >
> > > In this script, when MyUDF is calles, activityBag is null, and a
> warning
> > is
> > > issued:
> > >
> > > 2011-09-07 15:24:05,365 | WARN | Thread-30 | PigHadoopLogger |
> > >
> >
> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**expressionOperators.POCast:
> > > Unable to interpret value {(1239698069000,)} in field being converted
> to
> > > type bag, caught ParseException <Cannot convert (1239698069000,) to
> > > null:(timestamp:long,name:**chararray)> field discarded
> > >
> > > I guess that the regression is located into BinStorage
> > >
> > > Le 30/08/11 19:13, Daniel Dai a écrit :
> > >
> > >> Interesting, the log message seems to be clear, "Cannot convert
> > >> (1239698069000,) to null:(timestamp:long,name:**chararray)", but I
> > >> cannot find an explanation to that. I verified such conversion should
> > >> be valid on 0.9. Can you show me the script?
> > >>
> > >> Daniel
> > >>
> > >> On Tue, Aug 30, 2011 at 5:14 AM, Vincent Barat<
> vincent.ba...@gmail.com>
> > >>  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I have experienced the same issue by loading the data from raw text
> > files
> > >>> (using PIG server in local mode and the regular PIG loader) and from
> > >>> HBaseStorage.
> > >>> The issue is exactly the same in both cases: each time a NULL string
> is
> > >>> encountered, the cast to a data bag cannot be done.
> > >>>
> > >>> Le 29/08/11 19:12, Dmitriy Ryaboy a écrit :
> > >>>
> > >>>> How are you loading this data?
> > >>>>
> > >>>> D
> > >>>>
> > >>>> On Mon, Aug 29, 2011 at 8:05 AM, Vincent
> > >>>> Barat<vincent.ba...@gmail.com>**wrote:
> > >>>>
> > >>>>  I'm currently testing PIG 0.9.x branch.
> > >>>>> Several of my jobs that use to work correctly with PIG 0.8.1 now
> fail
> > >>>>> due
> > >>>>> to a cast error returning a null pointer in one of my UDF function.
> > >>>>>
> > >>>>> Apparently, PIG seems to be unable to convert some data to a bag
> when
> > >>>>> some
> > >>>>> of the tuple fields are null:
> > >>>>>
> > >>>>> 2011-08-29 15:50:48,720 | WARN | Thread-270 | PigHadoopLogger |
> > >>>>>
> > >>>>>
> > org.apache.pig.backend.hadoop.****executionengine.**physicalLayer.****
> > >>>>> expressionOperators.POCast:
> > >>>>> Unable to interpret value {(1239698069000,)} in field being
> converted
> > >>>>> to
> > >>>>> type bag, caught ParseException<Cannot convert (1239698069000,) to
> > >>>>> null:(timestamp:long,name:****chararray)>   field discarded
> > >>>>> 2011-08-29 15:50:48,729 | WARN | Thread-270 | LocalJobRunner |
> > >>>>> job_local_0019
> > >>>>>
> > >>>>> My UDF functions is:
> > >>>>>
> > >>>>>  /**
> > >>>>>   *...
> > >>>>>   * @param start start of the session (in milliseconds since epoch)
> > >>>>>   * @param end end of the session (in milliseconds since epoch)
> > >>>>>   * @param activities a bag containing a set of activities in the
> > form
> > >>>>> of
> > >>>>> a
> > >>>>> set of (timestamp:long,
> > >>>>>   *          name:chararray) tuples
> > >>>>>   * ...
> > >>>>>   */
> > >>>>>  public DataBag exec(Tuple input) throws IOException
> > >>>>>  {
> > >>>>>    /* Get session's start/end timestamps */
> > >>>>>    long startSession = (Long) input.get(0);
> > >>>>>    long endSession = (Long) input.get(1);
> > >>>>>
> > >>>>>    /* Get the activity bag */
> > >>>>>    DataBag activityBag = (DataBag) input.get(2);
> > >>>>>
> > >>>>>                                     ^  here
> > >>>>>
> > >>>>>
> > >>>>> Is that a regression ? Any idea to fix this ?
> > >>>>>
> > >>>>> Thanks a lot, I really need to jump to PIG 0.9.1 and 0.10.0 :-)
> > >>>>>
> > >>>>>
> >
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to