Hey Jacob, thanks for your reply. Well, congratulations on your blog. It seems very interesting. I'll take a look at your post later.
Regarding the second question. I have something like this: B = FOREACH A GENERATE MyUDF(args); This UDF stores something in S3 and the return value is not important (in fact it's just a boolean value). So I don't use relation B after this point. Well, it seems that because of that, Pig's optimizer skips/discards this statement. So, in order to make it work I had to insert another statement just doing something silly like: STORE B INTO 's3n://mybucket/dump'; And now it works. I think it's reasonable because everything is in AWS servers and the file is really small (just a boolean value). Is there another way to do that? Rodrigo. 2014-07-15 19:49 GMT+02:00 Jacob Perkins <[email protected]>: > > On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote: > > > 1) I'm creating my "input" data using Pig itself. It means that the > actual > > input is a small file with a few rows (few means not Big Data). And for > > each of these rows I create lots of data (my real input). > > > > Well, in order to do that, considering that the creation of the real > input > > is CPU bounded, I decided to create a separated file for each row and > LOAD > > them separately, this way allowing Pig to fire a different Map process > for > > each of them and hopefully obtaining some parallelization. Is it OK? > Seems totally reasonable to me, albeit laborious. Be sure to set > pig.splitCombination to false. Alternatively, you could try the approach > here and write your own simple inputFormat: > http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html > Similar ideas in that the "input" is actually just a very small file and > numerous simulations are run in parallel using pig. > > > > > 2) I have a UDF that I call in a projection relation. This UDF > communicates > > with my S3 bucket and the relation that is produced in this projection is > > never used. Well, it seems that Pig optimizer simply discards this UDF. > > What I did was to make this UDF return a boolean value and I store it on > S3 > > (a lightweight file). This way it gets executed. Any thoughts on this? > Can you explain further? It's not clear what you're trying to do/what > isn't working. > > --jacob > @thedatachef > >
