What we do in production is a combination of two approaches:

1) TSV delimited files (well, \u001 actually, to avoid comma and tab
escaping complexities).  Some but not all of these get bulk-loaded into our
reporting database in a post-processing step. We don't insert directly into
the db from Pig to avoid taking it down, and to make it easy to reload data
if needed without rerunning the pig script.

2) Protocol Buffers or Thrift files that serve as inputs to other jobs.

It's mostly tsvs. We do protobufs and thrift when the schema is well known
and settled upon, there is a need to keep and read the stuff in hadoop, and
we are likely to need to revisit the data often, or when we are creating
data sets to be ingested into other services. We like the binary formats for
space savings and easy schemas, but there's something to be said for easily
human-readable files and no need to predefine the schemas.

D

On Wed, Mar 23, 2011 at 11:12 AM, Jonathan Holloway <
jonathan.hollo...@gmail.com> wrote:

> I've got a general question surrounding the output of various Pig scripts
> and generally where people are
> storing that data and in what kind of format?
>
> I read Dmitriy's article on Apache log processing and noticed that the
> output of the scripts was a format more
> suitable for reporting and graphing upon - that of TSV files.
>
> At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
> format and I just wondered whether
> that was the best practice for large amounts of data in terms of
> organisation.  Is anybody using Hive to store the
> intermediate Pig data and reporting off that instead?  Or, are people
> generating graphs and analyses based off the
> raw Pig data in HDFS?
>
> Many thanks,
> Jon.
>

Reply via email to