Re: Thoughts

Rodrigo Ferreira Wed, 16 Jul 2014 02:21:28 -0700

Hey Jacob, thanks for your reply.

Well, congratulations on your blog. It seems very interesting. I'll take a
look at your post later.


Regarding the second question. I have something like this:

B = FOREACH A GENERATE MyUDF(args);

This UDF stores something in S3 and the return value is not important (in
fact it's just a boolean value). So I don't use relation B after this point.

Well, it seems that because of that, Pig's optimizer skips/discards this
statement. So, in order to make it work I had to insert another statement
just doing something silly like:

STORE B INTO 's3n://mybucket/dump';

And now it works. I think it's reasonable because everything is in AWS
servers and the file is really small (just a boolean value). Is there
another way to do that?

Rodrigo.


2014-07-15 19:49 GMT+02:00 Jacob Perkins <[email protected]>:

>
> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote:
>
> > 1) I'm creating my "input" data using Pig itself. It means that the
> actual
> > input is a small file with a few rows (few means not Big Data). And for
> > each of these rows I create lots of data (my real input).
> >
> > Well, in order to do that, considering that the creation of the real
> input
> > is CPU bounded, I decided to create a separated file for each row and
> LOAD
> > them separately, this way allowing Pig to fire a different Map process
> for
> > each of them and hopefully obtaining some parallelization. Is it OK?
> Seems totally reasonable to me, albeit laborious. Be sure to set
> pig.splitCombination to false. Alternatively, you could try the approach
> here and write your own simple inputFormat:
> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
> Similar ideas in that the "input" is actually just a very small file and
> numerous simulations are run in parallel using pig.
>
> >
> > 2) I have a UDF that I call in a projection relation. This UDF
> communicates
> > with my S3 bucket and the relation that is produced in this projection is
> > never used. Well, it seems that Pig optimizer simply discards this UDF.
> > What I did was to make this UDF return a boolean value and I store it on
> S3
> > (a lightweight file). This way it gets executed. Any thoughts on this?
> Can you explain further? It's not clear what you're trying to do/what
> isn't working.
>
> --jacob
> @thedatachef
>
>

Re: Thoughts

Reply via email to