Re: About Issue PIG-841

2011-03-13 Thread Dmitriy Ryaboy
Renato, Here's how file reading works in Hadoop (pretty detailed, but I actually simplified a few spots in this description -- one can get pretty creative by providing own implementations of certain interfaces). First, a little about HDFS. A file is physically stored in HDFS as a collection of b

Re: About Issue PIG-841

2011-03-13 Thread Renato Marroquín Mogrovejo
Thanks for answering Daniel! But there are a couple of things I don't quite get. For example, you say that each mapper will be reading just configured amount of rows, but wouldn't the mappers end up reading the whole file? And my other question is if this is implemented, do you know in which classe

Re: About Issue PIG-841

2011-03-07 Thread Daniel Dai
Sampling job in Pig is used in "order by" and "skewed join". It will be translated to a single map-reduce job. In the map, we sample the data with a configurable interval; in the reduce, we do a "group all" followed by a nested foreach. Within foreach, we do a nested sort and then feed the resu

About Issue PIG-841

2011-03-05 Thread Renato Marroquín Mogrovejo
Hey does anybody know if PIG-841 was developed? And if it was, how is it being used by Pig? Thanks in advance. Renato M.