Note that in my original proposal, I was suggesting we could track whether block size = 0 using a compressed bitmap. That way we can still avoid requests for zero-sized blocks.
On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin <r...@databricks.com> wrote: > Yes, that number is likely == 0 in any real workload ... > > > On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan <mri...@gmail.com> > wrote: > >> On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin <r...@databricks.com> wrote: >> > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan <mri...@gmail.com> >> > wrote: >> > >> >> >> >> > >> >> > The other thing we do need is the location of blocks. This is >> actually >> >> just >> >> > O(n) because we just need to know where the map was run. >> >> >> >> For well partitioned data, wont this not involve a lot of unwanted >> >> requests to nodes which are not hosting data for a reducer (and lack >> >> of ability to throttle). >> >> >> > >> > Was that a question? (I'm guessing it is). What do you mean exactly? >> >> >> I was not sure if I understood the proposal correctly - hence the >> query : if I understood it right - the number of wasted requests goes >> up by num_reducers * avg_nodes_not_hosting data. >> >> Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine ! >> >> Regards, >> Mridul >> > >