On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote:

> 1) I'm creating my "input" data using Pig itself. It means that the actual
> input is a small file with a few rows (few means not Big Data). And for
> each of these rows I create lots of data (my real input).
> 
> Well, in order to do that, considering that the creation of the real input
> is CPU bounded, I decided to create a separated file for each row and LOAD
> them separately, this way allowing Pig to fire a different Map process for
> each of them and hopefully obtaining some parallelization. Is it OK?
Seems totally reasonable to me, albeit laborious. Be sure to set 
pig.splitCombination to false. Alternatively, you could try the approach here 
and write your own simple inputFormat: 
http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html 
Similar ideas in that the "input" is actually just a very small file and 
numerous simulations are run in parallel using pig.

> 
> 2) I have a UDF that I call in a projection relation. This UDF communicates
> with my S3 bucket and the relation that is produced in this projection is
> never used. Well, it seems that Pig optimizer simply discards this UDF.
> What I did was to make this UDF return a boolean value and I store it on S3
> (a lightweight file). This way it gets executed. Any thoughts on this?
Can you explain further? It's not clear what you're trying to do/what isn't 
working.

--jacob
@thedatachef

Reply via email to