A simple solution would be to tag each tuple with a random number (such that each number has multiple url's associated with it - but not too large a number of urls), and simply group based on this field. In the reducer, you get a bag of url's for each random number : at which point, you can use multiple threads to fetch content and associate their responses with the appropriate input tuple.


You only need to ensure that :
a) Too many tuples dont get associated with a single random number (to the extent that it causes spills to disk).

b) Too few tuples dont get associated over all random numbers you use - else it degenerates to current case.

c) You seed the random number sensible, in order not to hit problems with having your tasks being non-repeatable.

Regards,
Mridul

On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote:
Hello,

First of all, great job creating pig, really a magnificent piece of software.

I do have a few questions about UDFs. I have a dataset with a list of url's I 
want to fetch. Since an EvalFunc can only process one tuple at a time and the 
asynchronous abilities of the UDF are deprecated, I can only fetch one url at a 
time. The problem is that fetching this one url takes a reasonable amount of 
time (1 to 5 seconds, there is a delay built in) so that really slows down the 
processing. I already converted the UDF into an Accumulator but that only seems 
to get fired after a group by. If would be nice to have some kind of Queue UDF 
which will queue the tuples until a certain amount is reached and than flushes 
the queue. That way I can add tuples to an internal list and on flush start 
multiple threads to go through the list of Tuples.

This is a workaround though, since the best solution would be to reintroduce 
the asynchronous UDF's (in which case I can schedule the threads as the tuples 
come in)

Any idea's on this? I already saw someone trying almost the same thing, but 
didn't get a definite answer from that one.

An other option is to increase the number of reducer slots on the cluster, but 
I'm afraid that would mean we overload the nodes in case of a heavy reduce 
phase.

Best Regards,

Daan

Reply via email to