Re: Schema for STRSPLIT output

Andrew Musselman Wed, 25 Jun 2014 20:39:06 -0700

I think you could specify a comma as the delimiter in your load statement:

x = load 'file.txt' using PigStorage(',');


You could specify the schema if needed on the way in after the PigStorage call 
with "as (a:chararray, b:chararray, ..., n:chararray)".

But if you don't know how many fields are possible that would be tough.  This 
is an extraordinary type of input format from what I've seen.

> On Jun 25, 2014, at 7:58 PM, Ashish Jain <[email protected]> wrote:
> 
> Hello,
> 
> Two lines from my data looks as follows -
> a, b, c, d, e
> a, b, c
> 
> I load the contents of my file and split each line based on ','.
> x = LOAD 'file.txt' as content:chararray;
> y = FOREACH x GENERATE STRSPLIT(content, ',') as tuple();
> 
> So my question is -
> 1) Is there any way I can specify the schema of y to be a tuple of various
> numbers of chararrays? Something on the lines of y = FOREACH x GENERATE
> STRSPLIT(content, ',') as tuple(chararray(*))
> 2) If I try to do the above in an UDF, how do I create output schema which
> depends on the input? From my experiments, outputSchema() is called before
> exec() so I can't specify the number of fields in my output schema.
> 
> The reason I am trying to do this is, once I get 'y', I want to write it to
> elasticsearch. The hadoop-elasticsearch plugin has direct mapping from
> chararray(pig)<->string(elasticsearch).
> 
> Thanks
> Ashish

Re: Schema for STRSPLIT output

Reply via email to