Re: Execution of outputSchema

Jonathan Coveney Fri, 13 Apr 2012 08:59:50 -0700

Raj,

If you serialize the inputSchema, getting the outputSchema is as easy as
just running outputSchema on it. Either way, I looked into it and it's a
feature in trunk, it's not even in 0.10...so yeah, if this is something you
need to do you're going to have to cook up another solution.

first, I would see if this patch can be backported to 0.10:
https://issues.apache.org/jira/browse/PIG-2337 if not, then you can
leverage the work they did to make a unique signature.

Be wary of using the UDFContext... the name is misleading. It is actually
shared between UDFs, and isn't a safe place to put things (without jumping
through hoops). Another issue that you have to contend with is multiple
instances of your UDF, especially multiple instances with different input.
Even if you push the data to Hadoop or the distributed cache or anywhere,
if you have 3 instances of the same UDF with different input schemas (and
thus potentially different output Schemas), how do you know which instances
of the UDF on the backend should grab which xml files?

Lastly, why do you need this information on the backend? there may be
another way to do what you're trying to do.

2012/4/13 Rajgopal Vaithiyanathan <raja.f...@gmail.com>

> Thanks Jonathan,
>
>
> But, the question is not about serializing input schema. however, i'm using
> 0.9.2 and i dont see getInputSchema in EvalFunc.. Please tell me how to use
> it. Right now, i'm serializing it using UDFContext
>
> The question was:
> I've implemented the outputSchema this way ;
>
>
>    public Schema outputSchema(Schema input) {
>
>        if(input.getAliases().contains("sales")) {
>            return generateOutputSchemaFrom("sales.xml");
>        }
>
>        else if(input.getAliases().contains("others")) {
>            return generateOutputSchemaFrom("others.xml");
>        }
>
>    }
>
> The question was where i should place this *sales.xml and others.xml.*?
>
>
>
> On Fri, Apr 13, 2012 at 2:08 PM, Jonathan Coveney <jcove...@gmail.com
> >wrote:
>
> > Raj,
> >
> > The outputSchema is executed on the front end[1] (and beware: it can be
> > called many times, and beyond that, UDFs are instantiated many times on
> the
> > front end).
> >
> > What is your goal with serializing the output schema to XML? What are you
> > trying to do? I should also mention that EvalFunc now has
> > "getInputSchema()," as it serializes the input schema for you... but
> yeah,
> > some context around what you want to do is key.
> >
> > [1] front end meaning the client side where the script is parsed and the
> > job jar created
> >
> > 2012/4/13 Rajgopal Vaithiyanathan <raja.f...@gmail.com>
> >
> > > Where will the outputSchema be executed? in the client or as a
> mapreduce
> > ?
> > >
> > > I've planned to keep the output schema as an XML and let the
> outputSchema
> > > method read it and generate the Schema object with respect ti the XML.
> > >
> > > Where should I place this XML file ? Client or HDFS ?
> > >
> > > :)
> > > Raj
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Rajgopal Vaithiyanathan.
>

Re: Execution of outputSchema

Reply via email to