Re: Storing Headers in Pig Output File

Dmitriy Ryaboy Thu, 26 May 2011 12:07:32 -0700

Try -getmerge

D


On Thu, May 26, 2011 at 11:20 AM, Subhramanian, Deepak
<[email protected]> wrote:
> I tried using the fs -cat and sh -cat function to combine the header and
> output file to a new file . But it is not working.  Does hadoop give an
> option to combine two files  to a new file in pig script.
>
> This is the command I used at the end of the pig script.
>
> STORE out3 INTO '$OUTPUT' USING
> org.apache.pig.piggybank.storage.PigStorageSchema();
>
> sh -cat $OUTPUT/.pig_header  $OUTPUT/part* > $OUTPUT/top10adv.csv
>
>
> hadoop fs -ls pigdbck/output/top10advperimpfileh5
> Found 4 items
> -rw-r--r--   1 root supergroup         30 2011-05-26 17:52
> /user/root/pigdbck/output/top10advperimpfileh5/.pig_header
> -rw-r--r--   1 root supergroup        361 2011-05-26 17:52
> /user/root/pigdbck/output/top10advperimpfileh5/.pig_schema
> drwxr-xr-x   - root supergroup          0 2011-05-26 17:51
> /user/root/pigdbck/output/top10advperimpfileh5/_logs
> -rw-r--r--   1 root supergroup        117 2011-05-26 17:52
> /user/root/pigdbck/output/top10advperimpfileh5/part-r-00000
>
>
> On 26 May 2011 12:02, Subhramanian, Deepak <
> [email protected]> wrote:
>
>> I thought any java class extension was a UDF. Thanks Dmitriy for
>> clarifying. Yes. I meant extending the StoreFunce. I guess I will use the
>> PigStorageSchema for the time being as I am tight on my deadlines. And use
>> the cat to concatenate the header. I didnt realized that we can use the cat
>> directly in the pig script and that is why thought of extending the
>> StoreFunc.  Thanks Alan for your inputs.
>>
>>  I will have to read more on how the output part files are created on hdfs
>> so that I can combine all the part files at the end of the pig script for a
>> final output  if the file size is very big.
>>
>>
>> On 25 May 2011 21:22, Dmitriy Ryaboy <[email protected]> wrote:
>>
>>> Still not clear on how you expect a UDF to help.. normally when we say
>>> UDFs, we mean functions work on individual tuples. They don't have
>>> anything to do with how you store data.
>>>
>>> You probably mean StoreFunc; since in this case you want a StoreFunc
>>> that messes with the file format, as opposed to writing a side file
>>> like PigStorageSchema does, you'll need to go pretty deep -- write a
>>> whole StoreFunc + OutputFormat + RecordWriter stack.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 25, 2011 at 12:51 PM, Subhramanian, Deepak
>>> <[email protected]> wrote:
>>> > Thanks for the inputs. I am looking for a UDF which I can use to store
>>> the
>>> > headers in the pig output file.
>>> >
>>> > On 25 May 2011 18:30, Dmitriy Ryaboy <[email protected]> wrote:
>>> >
>>> >> Can you explain what UDF you are looking for?
>>> >> The intended usage for the .pig_header file is to cat it:
>>> >>
>>> >> hadoop fs -cat myresults/.pig_header myresults/part*
>>> >>
>>> >> (which drops the header right on top of your data).
>>> >>
>>> >> We don't want to put the header inside the data files because that can
>>> >> break subsequent processing.
>>> >>
>>> >> As for names of the fields, that's a pig feature, it's there for
>>> >> disambiguation. If you don't like it, you can rename the fields:
>>> >> FLATTEN(aggregated) as (advertiserId, Advertiser, OrderId, ....)
>>> >>
>>> >>
>>> >>
>>> >> D
>>> >>
>>> >> On Wed, May 25, 2011 at 9:00 AM, Subhramanian, Deepak
>>> >> <[email protected]> wrote:
>>> >> > Hi , I just realized that it is creating .pig_header file in the same
>>> >> output
>>> >> > directory. I guess I need to create a new UDF. Also if I am grouping
>>> it
>>> >> is
>>> >> > appending the tag aggregated::group: to the header column. Is Flatten
>>> is
>>> >> not
>>> >> > suppose to remove the group ?
>>> >> >
>>> >> >  cat .pig_header
>>> >> > aggregated::group::AdvertiserID null::Advertiser
>>> >> >  aggregated::group::OrderID      aggregated::group::AdID
>>> >> > aggregated::group::CreativeID   aggregated::group::CreativeVersion
>>> >> > aggregated::group::CreativeSizeID       aggregated::group::SiteID
>>> >> > aggregated::group::PageID       aggregated::group::Keyword
>>> >> >  aggregated::Impressions
>>> >> >
>>> >> >
>>> >> >
>>> >> > On 25 May 2011 16:48, Subhramanian, Deepak <
>>> >> > [email protected]> wrote:
>>> >> >
>>> >> >> I tried the PigStorageSchema. For some reason it doesnt create the
>>> >> headers.
>>> >> >> Is it because I am loading the data using another UDF ?
>>> >> >>
>>> >> >> This is the command I used in the pigscript..
>>> >> >>
>>> >> >> STORE out INTO '$OUTPUT' USING
>>> >> >> org.apache.pig.piggybank.storage.PigStorageSchema();
>>> >> >>
>>> >> >> Thanks, Deepak
>>> >> >>
>>> >> >>
>>> >> >> On 25 May 2011 16:13, Dmitriy Ryaboy <[email protected]> wrote:
>>> >> >>
>>> >> >>> You can try PigStorageSchema from the piggybank.
>>> >> >>>
>>> >> >>> -----Original Message-----
>>> >> >>> From: "Subhramanian, Deepak" <[email protected]>
>>> >> >>> To: [email protected]
>>> >> >>> Sent: 5/25/2011 5:28 AM
>>> >> >>> Subject: Storing Headers in Pig Output File
>>> >> >>>
>>> >> >>> Is there a way to store the headers (titles of each) column using
>>> the
>>> >> >>> Store
>>> >> >>> command in Pig Script  (STORE out3 INTO '$OUTPUT' USING
>>> PigStorage();.
>>> >> >>> Right
>>> >> >>> now it stores only the data. Somewhere I read in Pig0.8 it stores
>>> the
>>> >> >>> header
>>> >> >>> with map reduce option. Do we have to supply extra parameters ?
>>> >> >>>
>>> >> >>> Thanks, Deepak
>>> >> >>>
>>> >
>>>
>>
>>
>
> --
> "Please consider the environment before printing this e-mail"
>
> The Newspaper Marketing Agency: Opening Up Newspapers:
> www.nmauk.co.uk
>
> This e-mail and any attachments are confidential, may be legally privileged 
> and are the property of
> News International Limited (which is the holding company for the News 
> International group, is
> registered in England under number 81701 and whose registered office is 3 
> Thomas More Square,
> London E98 1XY, VAT number GB 243 8054 69), on whose systems they were 
> generated.
>
> If you have received this e-mail in error, please notify the sender 
> immediately and do not use,
> distribute, store or copy it in any way. Statements or opinions in this 
> e-mail or any attachment are
> those of the author and are not necessarily agreed or authorised by News 
> International Limited or
> any member of its group. News International Limited may monitor outgoing or 
> incoming emails as
> permitted by law. It accepts no liability for viruses introduced by this 
> e-mail or attachments.
>

Re: Storing Headers in Pig Output File

Reply via email to