Hi,
I've create a Jira for this issue: https://issues.apache.org/jira/browse/PIG-1717 The getPartitionKeys in my case will always return the keys that are defined as partitions in the path so that if the user loads from : /log/type1/daydate=2010-11-01 the partition key returned always is "daydate". Currently the following code does not cause the loader to be notified on the partition filter: A = load 'input' using MyLoader() as (q, p, daydate); F = FILTER A BY daydate='2010-11-01'; If in some way pig could call the getPartitionKeys and then be aware that the daydate is a partition, all would work well. Cheers, From: Thejas M Nair [mailto:[email protected]] Sent: Thursday, November 11, 2010 3:18 PM To: [email protected]; Gerrit van Vuuren Subject: Re: pig LoadMetaData find schema in AS clause from Loader. Yes, setPartitionFilter can be called only if pig knows the partition columns. Without knowing the partition columns the partition filter cannot be extracted. If a user specifies a schema in the load statement, pig finds the partition columns by finding the position of columns returned by getPartitionKeys in the user defined schema, based on mapping of schema from getSchema() to user specified schema. Ie, pig assumes that the columns returned in getPartitionKeys() are columns in the schema returned in getSchema(). In your case, does getPartitionKeys return columns that are specified in the user defined schema ? Yes, please open a jira, and lets discuss it there. I think at least javadoc might need to be updated -Thejas On 11/11/10 1:30 AM, "Gerrit Jansen van Vuuren" <[email protected]> wrote: Hi, I guess it should only call the setPartitionFilter when the LoadMetadata:getPartitionKeys returns a none null value. Currently getPartitionKeys is only called if the Loader returns a schema. Should I create a Jira and try at proposing a fix to this? Cheers, Gerrit -----Original Message----- From: Alan Gates [mailto:[email protected]] Sent: Wednesday, November 10, 2010 9:56 PM To: [email protected] Subject: Re: pig LoadMetaData find schema in AS clause from Loader. To answer your direct question, no, there is currently no provision in the interface for Pig to provide the user defined schema to the load function. But it seems like the real solution to your problem is that LoadMetaData:setPartitionFilter ought to be called regardless of whether the loader returns a schema. Is there a technical reason we don't do that? Alan. On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote: > HI, > > > > > > Is there any way in Pig where a LoadFunc can retrieve the Schema > definition > entered by the user in the AS clause? > > e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int, b:int); > > > > My question comes from the below problem I'm facing: > > > > So I'm writing a Loader that adds partition fields to the Schema. E.g. > daydate, day, year month etc. > > These partitions are used to filter out entire folders in the storage > location. > > I want to use the FILTER statement to filter by these keys. > > > > So if I create a Loader that returns its own Schema the following > works and > the LoadMetaData: setPartitionFilter method gets called correctly by > pig. > > e.g. > > A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will > parse > this and also add the partition folder daydate > > F = FILTER A BY daydate='2010-11-01'; > > STORE F INTO '$OUTPUT' > > > > > > But if the Loader does not return a Schema and the Schema is defined > by the > user in the AS clause Pig never calls > LoadMetaData:setPartitionFilter at all > and the partition filtering never happens. > > e.g. > > A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray); > > F = FILTER A BY daydate='2010-11-01'; > > STORE F INTO '$OUTPUT'; > > > > Any suggestions? > > > > Thanks, > > Gerrit >
