This is a bad way to do it, but I wonder what would happen in that case.
Assuming we have following file
element
...
/element
element
...
/element
and we use TextInputFormat to get single lines as records and then check if we
find a element then buffer all the
2) should be the right approach, more concisely, you can say:
processed = foreach flatTuple generate com.myUDF.UDF(*)
It seems to be something wrong with your schema propagation. How do you
generate inputBag? If it is generated by UDF, make sure your UDF declare
proper output schema. You can
I've got a general question surrounding the output of various Pig scripts
and generally where people are
storing that data and in what kind of format?
I read Dmitriy's article on Apache log processing and noticed that the
output of the scripts was a format more
suitable for reporting and graphing
On 23 March 2011 18:12, Jonathan Holloway jonathan.hollo...@gmail.comwrote:
I've got a general question surrounding the output of various Pig scripts
and generally where people are
storing that data and in what kind of format?
...
At present the results from my Pig scripts end up in HDFS in
Hey Jon,
I think a common approach is to use Pig (and MR/Hadoop in general) as purely
the heavy lifter, doing all the merge-downs, aggregations and such of the
data. At Nokia we tend to output a lot of data from Pig/MR as TSV or CSV
(using PigStorage) and then use Sqoop to push that into a MySQL
Hi,
We've seen a strange problem where some Pig jobs would just run fewer
mappers concurrently than the mapper capacity. Specifically we have a 10
node cluster and each is configured to have 12 mappers. Normally we have 120
mappers running. But for some Pig jobs it will only have 10 mappers
What we do in production is a combination of two approaches:
1) TSV delimited files (well, \u001 actually, to avoid comma and tab
escaping complexities). Some but not all of these get bulk-loaded into our
reporting database in a post-processing step. We don't insert directly into
the db from Pig
What version of Pig are you using? Starting in 0.8 Pig will combine
small blocks into a single map. This prevents jobs that actually are
reading small amounts of data from taking a lot of slots on the
cluster. You can turn this off by adding -
Dpig.noSplitCombination=true to your command