Regarding your UDF, you're creating a lot of overhead for storing something outside of the Hadoop ecosystem, imho.
Why not create a dump of your booleans and then have a separate script push them all to S3 at one time after you're Pig script is complete? That way you wouldn't be waiting on puts to S3 to complete your script. On Wed, Jul 16, 2014 at 5:20 AM, Rodrigo Ferreira <[email protected]> wrote: > Hey Jacob, thanks for your reply. > > Well, congratulations on your blog. It seems very interesting. I'll take a > look at your post later. > > Regarding the second question. I have something like this: > > B = FOREACH A GENERATE MyUDF(args); > > This UDF stores something in S3 and the return value is not important (in > fact it's just a boolean value). So I don't use relation B after this > point. > > Well, it seems that because of that, Pig's optimizer skips/discards this > statement. So, in order to make it work I had to insert another statement > just doing something silly like: > > STORE B INTO 's3n://mybucket/dump'; > > And now it works. I think it's reasonable because everything is in AWS > servers and the file is really small (just a boolean value). Is there > another way to do that? > > Rodrigo. > > > 2014-07-15 19:49 GMT+02:00 Jacob Perkins <[email protected]>: > > > > > On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote: > > > > > 1) I'm creating my "input" data using Pig itself. It means that the > > actual > > > input is a small file with a few rows (few means not Big Data). And for > > > each of these rows I create lots of data (my real input). > > > > > > Well, in order to do that, considering that the creation of the real > > input > > > is CPU bounded, I decided to create a separated file for each row and > > LOAD > > > them separately, this way allowing Pig to fire a different Map process > > for > > > each of them and hopefully obtaining some parallelization. Is it OK? > > Seems totally reasonable to me, albeit laborious. Be sure to set > > pig.splitCombination to false. Alternatively, you could try the approach > > here and write your own simple inputFormat: > > > http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html > > Similar ideas in that the "input" is actually just a very small file and > > numerous simulations are run in parallel using pig. > > > > > > > > 2) I have a UDF that I call in a projection relation. This UDF > > communicates > > > with my S3 bucket and the relation that is produced in this projection > is > > > never used. Well, it seems that Pig optimizer simply discards this UDF. > > > What I did was to make this UDF return a boolean value and I store it > on > > S3 > > > (a lightweight file). This way it gets executed. Any thoughts on this? > > Can you explain further? It's not clear what you're trying to do/what > > isn't working. > > > > --jacob > > @thedatachef > > > > >
