Re: Thoughts

Jacob Perkins Wed, 16 Jul 2014 07:24:47 -0700

Rodrigo,

Write your own StoreFunc and use it instead of the udf. PigStorage can already 
write to s3 so it should be straightforward to simply subclass that.


--jacob
@thedatachef

On Jul 16, 2014, at 2:20 AM, Rodrigo Ferreira <[email protected]> wrote:

> Hey Jacob, thanks for your reply.
> 
> Well, congratulations on your blog. It seems very interesting. I'll take a
> look at your post later.
> 
> Regarding the second question. I have something like this:
> 
> B = FOREACH A GENERATE MyUDF(args);
> 
> This UDF stores something in S3 and the return value is not important (in
> fact it's just a boolean value). So I don't use relation B after this point.
> 
> Well, it seems that because of that, Pig's optimizer skips/discards this
> statement. So, in order to make it work I had to insert another statement
> just doing something silly like:
> 
> STORE B INTO 's3n://mybucket/dump';
> 
> And now it works. I think it's reasonable because everything is in AWS
> servers and the file is really small (just a boolean value). Is there
> another way to do that?
> 
> Rodrigo.
> 
> 
> 2014-07-15 19:49 GMT+02:00 Jacob Perkins <[email protected]>:
> 
>> 
>> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <[email protected]> wrote:
>> 
>>> 1) I'm creating my "input" data using Pig itself. It means that the
>> actual
>>> input is a small file with a few rows (few means not Big Data). And for
>>> each of these rows I create lots of data (my real input).
>>> 
>>> Well, in order to do that, considering that the creation of the real
>> input
>>> is CPU bounded, I decided to create a separated file for each row and
>> LOAD
>>> them separately, this way allowing Pig to fire a different Map process
>> for
>>> each of them and hopefully obtaining some parallelization. Is it OK?
>> Seems totally reasonable to me, albeit laborious. Be sure to set
>> pig.splitCombination to false. Alternatively, you could try the approach
>> here and write your own simple inputFormat:
>> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
>> Similar ideas in that the "input" is actually just a very small file and
>> numerous simulations are run in parallel using pig.
>> 
>>> 
>>> 2) I have a UDF that I call in a projection relation. This UDF
>> communicates
>>> with my S3 bucket and the relation that is produced in this projection is
>>> never used. Well, it seems that Pig optimizer simply discards this UDF.
>>> What I did was to make this UDF return a boolean value and I store it on
>> S3
>>> (a lightweight file). This way it gets executed. Any thoughts on this?
>> Can you explain further? It's not clear what you're trying to do/what
>> isn't working.
>> 
>> --jacob
>> @thedatachef
>> 
>>

Re: Thoughts

Reply via email to