Rodrigo, Write your own StoreFunc and use it instead of the udf. PigStorage can already write to s3 so it should be straightforward to simply subclass that.
--jacob @thedatachef On Jul 16, 2014, at 2:20 AM, Rodrigo Ferreira <[email protected]> wrote: > Hey Jacob, thanks for your reply. > > Well, congratulations on your blog. It seems very interesting. I'll take a > look at your post later. > > Regarding the second question. I have something like this: > > B = FOREACH A GENERATE MyUDF(args); > > This UDF stores something in S3 and the return value is not important (in > fact it's just a boolean value). So I don't use relation B after this point. > > Well, it seems that because of that, Pig's optimizer skips/discards this > statement. So, in order to make it work I had to insert another statement > just doing something silly like: > > STORE B INTO 's3n://mybucket/dump'; > > And now it works. I think it's reasonable because everything is in AWS > servers and the file is really small (just a boolean value). Is there > another way to do that? > > Rodrigo. > > > 2014-07-15 19:49 GMT+02:00 Jacob Perkins <[email protected]>: > >> >> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote: >> >>> 1) I'm creating my "input" data using Pig itself. It means that the >> actual >>> input is a small file with a few rows (few means not Big Data). And for >>> each of these rows I create lots of data (my real input). >>> >>> Well, in order to do that, considering that the creation of the real >> input >>> is CPU bounded, I decided to create a separated file for each row and >> LOAD >>> them separately, this way allowing Pig to fire a different Map process >> for >>> each of them and hopefully obtaining some parallelization. Is it OK? >> Seems totally reasonable to me, albeit laborious. Be sure to set >> pig.splitCombination to false. Alternatively, you could try the approach >> here and write your own simple inputFormat: >> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html >> Similar ideas in that the "input" is actually just a very small file and >> numerous simulations are run in parallel using pig. >> >>> >>> 2) I have a UDF that I call in a projection relation. This UDF >> communicates >>> with my S3 bucket and the relation that is produced in this projection is >>> never used. Well, it seems that Pig optimizer simply discards this UDF. >>> What I did was to make this UDF return a boolean value and I store it on >> S3 >>> (a lightweight file). This way it gets executed. Any thoughts on this? >> Can you explain further? It's not clear what you're trying to do/what >> isn't working. >> >> --jacob >> @thedatachef >> >>
