RE: Custom Slicer

2011-03-23 Thread Lai Will
This is a bad way to do it, but I wonder what would happen in that case. Assuming we have following file element ... /element element ... /element and we use TextInputFormat to get single lines as records and then check if we find a element then buffer all the

Re: Unable to access Map within a tuple.

2011-03-23 Thread Daniel Dai
2) should be the right approach, more concisely, you can say: processed = foreach flatTuple generate com.myUDF.UDF(*) It seems to be something wrong with your schema propagation. How do you generate inputBag? If it is generated by UDF, make sure your UDF declare proper output schema. You can

Storing and reporting off Pig data

2011-03-23 Thread Jonathan Holloway
I've got a general question surrounding the output of various Pig scripts and generally where people are storing that data and in what kind of format? I read Dmitriy's article on Apache log processing and noticed that the output of the scripts was a format more suitable for reporting and graphing

Re: Storing and reporting off Pig data

2011-03-23 Thread Alex McLintock
On 23 March 2011 18:12, Jonathan Holloway jonathan.hollo...@gmail.comwrote: I've got a general question surrounding the output of various Pig scripts and generally where people are storing that data and in what kind of format? ... At present the results from my Pig scripts end up in HDFS in

Re: Storing and reporting off Pig data

2011-03-23 Thread Josh Devins
Hey Jon, I think a common approach is to use Pig (and MR/Hadoop in general) as purely the heavy lifter, doing all the merge-downs, aggregations and such of the data. At Nokia we tend to output a lot of data from Pig/MR as TSV or CSV (using PigStorage) and then use Sqoop to push that into a MySQL

possibly Pig throttles the number of mappers

2011-03-23 Thread Dexin Wang
Hi, We've seen a strange problem where some Pig jobs would just run fewer mappers concurrently than the mapper capacity. Specifically we have a 10 node cluster and each is configured to have 12 mappers. Normally we have 120 mappers running. But for some Pig jobs it will only have 10 mappers

Re: Storing and reporting off Pig data

2011-03-23 Thread Dmitriy Ryaboy
What we do in production is a combination of two approaches: 1) TSV delimited files (well, \u001 actually, to avoid comma and tab escaping complexities). Some but not all of these get bulk-loaded into our reporting database in a post-processing step. We don't insert directly into the db from Pig

Re: possibly Pig throttles the number of mappers

2011-03-23 Thread Alan Gates
What version of Pig are you using? Starting in 0.8 Pig will combine small blocks into a single map. This prevents jobs that actually are reading small amounts of data from taking a lot of slots on the cluster. You can turn this off by adding - Dpig.noSplitCombination=true to your command