The issue you are going to have is that you will only want one of your
reducers, the one writing part-0000, to put out the header line.
Otherwise it will get repeated at various points through your data.
I'm not sure whether the necessary info is available to the store
function to decide whether it is writing part-0000 or not. Unless
your data is quite large you can use the PigStorageSchema as is and at
the "cat" line Dmitriy suggests as the last line in your Pig Latin
script. That will cause an extra read and write of the data, but it
will produce one file with the header in the right place without
requiring a new store function.
Alan.
On May 25, 2011, at 1:22 PM, Dmitriy Ryaboy wrote:
Still not clear on how you expect a UDF to help.. normally when we say
UDFs, we mean functions work on individual tuples. They don't have
anything to do with how you store data.
You probably mean StoreFunc; since in this case you want a StoreFunc
that messes with the file format, as opposed to writing a side file
like PigStorageSchema does, you'll need to go pretty deep -- write a
whole StoreFunc + OutputFormat + RecordWriter stack.
On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
<[email protected]> wrote:
Thanks for the inputs. I am looking for a UDF which I can use to
store the
headers in the pig output file.
On 25 May 2011 18:30, Dmitriy Ryaboy <[email protected]> wrote:
Can you explain what UDF you are looking for?
The intended usage for the .pig_header file is to cat it:
hadoop fs -cat myresults/.pig_header myresults/part*
(which drops the header right on top of your data).
We don't want to put the header inside the data files because that
can
break subsequent processing.
As for names of the fields, that's a pig feature, it's there for
disambiguation. If you don't like it, you can rename the fields:
FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)
D
On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
<[email protected]> wrote:
Hi , I just realized that it is creating .pig_header file in the
same
output
directory. I guess I need to create a new UDF. Also if I am
grouping it
is
appending the tag aggregated::group: to the header column. Is
Flatten is
not
suppose to remove the group ?
cat .pig_header
aggregated::group::AdvertiserID null::Advertiser
aggregated::group::OrderID aggregated::group::AdID
aggregated::group::CreativeID aggregated::group::CreativeVersion
aggregated::group::CreativeSizeID aggregated::group::SiteID
aggregated::group::PageID aggregated::group::Keyword
aggregated::Impressions
On 25 May 2011 16:48, Subhramanian, Deepak <
[email protected]> wrote:
I tried the PigStorageSchema. For some reason it doesnt create the
headers.
Is it because I am loading the data using another UDF ?
This is the command I used in the pigscript..
STORE out INTO '$OUTPUT' USING
org.apache.pig.piggybank.storage.PigStorageSchema();
Thanks, Deepak
On 25 May 2011 16:13, Dmitriy Ryaboy <[email protected]> wrote:
You can try PigStorageSchema from the piggybank.
-----Original Message-----
From: "Subhramanian, Deepak" <[email protected]>
To: [email protected]
Sent: 5/25/2011 5:28 AM
Subject: Storing Headers in Pig Output File
Is there a way to store the headers (titles of each) column
using the
Store
command in Pig Script (STORE out3 INTO '$OUTPUT' USING
PigStorage();.
Right
now it stores only the data. Somewhere I read in Pig0.8 it
stores the
header
with map reduce option. Do we have to supply extra parameters ?
Thanks, Deepak
--
"Please consider the environment before printing this e-mail"
The Newspaper Marketing Agency: Opening Up Newspapers:
www.nmauk.co.uk
This e-mail and any attachments are confidential, may be legally
privileged and are the property of
News International Limited (which is the holding company for
the News
International group, is
registered in England under number 81701 and whose registered
office is
3
Thomas More Square,
London E98 1XY, VAT number GB 243 8054 69), on whose systems
they were
generated.
If you have received this e-mail in error, please notify the
sender
immediately and do not use,
distribute, store or copy it in any way. Statements or o
[truncated by sender]
--
Deepak Subhramanian
Data & Analytics
News International, Digital Technology
Email: [email protected]
--
Deepak Subhramanian
Data & Analytics
News International, Digital Technology
Email: [email protected]
--
"Please consider the environment before printing this e-mail"
The Newspaper Marketing Agency: Opening Up Newspapers:
www.nmauk.co.uk
This e-mail and any attachments are confidential, may be legally
privileged and are the property of
News International Limited (which is the holding company for the
News
International group, is
registered in England under number 81701 and whose registered
office is 3
Thomas More Square,
London E98 1XY, VAT number GB 243 8054 69), on whose systems they
were
generated.
If you have received this e-mail in error, please notify the sender
immediately and do not use,
distribute, store or copy it in any way. Statements or opinions
in this
e-mail or any attachment are
those of the author and are not necessarily agreed or authorised
by News
International Limited or
any member of its group. News International Limited may monitor
outgoing
or incoming emails as
permitted by law. It accepts no liability for viruses introduced
by this
e-mail or attachments.
--
Deepak Subhramanian
Data & Analytics
News International, Digital Technology
Email: [email protected]
--
"Please consider the environment before printing this e-mail"
The Newspaper Marketing Agency: Opening Up Newspapers:
www.nmauk.co.uk
This e-mail and any attachments are confidential, may be legally
privileged and are the property of
News International Limited (which is the holding company for the
News International group, is
registered in England under number 81701 and whose registered
office is 3 Thomas More Square,
London E98 1XY, VAT number GB 243 8054 69), on whose systems they
were generated.
If you have received this e-mail in error, please notify the sender
immediately and do not use,
distribute, store or copy it in any way. Statements or opinions in
this e-mail or any attachment are
those of the author and are not necessarily agreed or authorised by
News International Limited or
any member of its group. News International Limited may monitor
outgoing or incoming emails as
permitted by law. It accepts no liability for viruses introduced by
this e-mail or attachments.