[jira] [Commented] (CRUNCH-91) Enable custom output file naming

Gabriel Reid (JIRA) Tue, 09 Oct 2012 00:30:10 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13472196#comment-13472196
 ]


Gabriel Reid commented on CRUNCH-91:
------------------------------------

Thanks for taking a look at it Josh. I'm still feeling a bit torn on this one 
-- on the one hand, this (the ability to give output files meaningful names)  
is definitely a use case that is needed in my day-to-day work. On the other 
hand, I'm a bit concerned about this being a step towards putting too many 
bells and whistles into Crunch, as we alternatively just have a config option 
that allows you to keep the default output names provided by Hadoop, and leave 
file renaming operations up to the developer.

The really cool feature (well, I think it's cool) that I can see us being able 
to provide if we do go for this is to be able to have an API something like 
this:

// Some kind of aggregation per product
PTable<Product, PurchaseSummary> productsAndPurchaseSummaries = ...; 

// Writes out the products and purchase summary, with one file per product 
manufacturer, and the file name
// is the name of the product manufacturer which is extracted from the Product 
value
pipeline.write(productsAndPurchaseSummaries, At.fanOut(outputDir, new 
ManufacturerExtractionFn());

Does that sway you (or anyone else) any more in one direction or the other? 
Obviously I want to try to do something that is useful for general use cases, 
and not just mine (which is currently mostly based around processing 
geographical data and outputting it into named files).
                
> Enable custom output file naming
> --------------------------------
>
>                 Key: CRUNCH-91
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-91
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-91.patch
>
>
> The current output file naming behavior in Crunch is to use the classic 
> Hadoop-style file naming (i.e. part-m-00001, part-r-00002), with the 
> numerical part of the filename being set based on the number of existing 
> files in the output directory to avoid naming collisions.
> The intention of this issue is to allow developers to define their own output 
> file names for Crunch output files.
> The original underlying motivation for this issue is having a custom 
> partitioner in a job which routes records to a specific partition (and 
> therefore reducer) based on content of the record, and then needing to 
> perform file renaming operations on the output files to allow their names to 
> include specific information about the partition they contain. The partition 
> number of files currently gets discarded by Crunch, making this renaming 
> impossible. The approach proposed here (custom file naming within Crunch) 
> goes one step further, giving developers a hook to actually define their own 
> output file naming scheme.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-91) Enable custom output file naming

Reply via email to