Hi everyone, I'm developing a quite complex system using Pig and I'd like to confirm some ideas if possible. They are not really questions. They are more like thoughts.
1) I'm creating my "input" data using Pig itself. It means that the actual input is a small file with a few rows (few means not Big Data). And for each of these rows I create lots of data (my real input). Well, in order to do that, considering that the creation of the real input is CPU bounded, I decided to create a separated file for each row and LOAD them separately, this way allowing Pig to fire a different Map process for each of them and hopefully obtaining some parallelization. Is it OK? 2) I have a UDF that I call in a projection relation. This UDF communicates with my S3 bucket and the relation that is produced in this projection is never used. Well, it seems that Pig optimizer simply discards this UDF. What I did was to make this UDF return a boolean value and I store it on S3 (a lightweight file). This way it gets executed. Any thoughts on this? Thank you. I'll come back later on with other ideas. I hope this reasoning may help someone :) Rodrigo Ferreira
