It seems logical too that launching 4000 map tasks on a 20 node cluster is going to have a lot of overhead with it. 20 does not seem like the ideal number, but I don't really know the internals of Cassandra that well. You might want to post this question on the Cassandra list to see if they can help you identify a way to increase the number of map tasks.
--Bobby Evans On 11/5/11 9:33 AM, "Brendan W." <bw8...@gmail.com> wrote: Yeah, that's my guess now, that somebody must have hacked the Cassandra libs on me...just wanted to see if there were other possibilities for where that parameter was being set. Thanks a lot for the help. On Fri, Nov 4, 2011 at 2:11 PM, Harsh J <ha...@cloudera.com> wrote: > Could just be that Cassandra has changed the way their splits generate? > Was Cassandra client libs changed at any point? Have you looked at its > input formats' sources? > > On 04-Nov-2011, at 10:05 PM, Brendan W. wrote: > > > Plain Java MR, using the Cassandra inputFormat to read out of Cassandra. > > > > Perhaps somebody hacked the inputFormat code on me... > > > > But what's weird is that the parameter mapred.map.tasks didn't appear in > > the job confs before at all. Now it does, with a value of 20 (happens to > > be the # of machines in the cluster), and that's without the jobs or the > > mapred-site.xml files themselves changing. > > > > The inputSplitSize is set specifically in the jobs, and has not been > > changed (except I subsequently fiddled with it a little to see if it > > affected the fact that I was getting 20 splits, and it didn't affect > > that...just the split size, not the number). > > > > After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20", > > before a list of the input splits...sort of looks like a hack but I can't > > find where it is. > > > > On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <ha...@cloudera.com> wrote: > > > >> Brendan, > >> > >> Are these jobs (whose split behavior has changed) via Hive/etc. or plain > >> Java MR? > >> > >> In case its the former, do you have users using newer versions of them? > >> > >> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote: > >> > >>> Hi, > >>> > >>> In the jobs running on my cluster of 20 machines, I used to run jobs > (via > >>> "hadoop jar ...") that would spawn around 4000 map tasks. Now when I > run > >>> the same jobs, that number is 20; and I notice that in the job > >>> configuration, the parameter mapred.map.tasks is set to 20, whereas it > >>> never used to be present at all in the configuration file. > >>> > >>> Changing the input split size in the job doesn't affect this--I get the > >>> size split I ask for, but the *number* of input splits is still capped > at > >>> 20--i.e., the job isn't reading all of my data. > >>> > >>> The mystery to me is where this parameter could be getting set. It is > >> not > >>> present in the mapred-site.xml file in <hadoop home>/conf on any > machine > >> in > >>> the cluster, and it is not being set in the job (I'm running out of the > >>> same jar I always did; no updates). > >>> > >>> Is there *anywhere* else this parameter could possibly be getting set? > >>> I've stopped and restarted map-reduce on the cluster with no > >> effect...it's > >>> getting re-read in from somewhere, but I can't figure out where. > >>> > >>> Thanks a lot, > >>> > >>> Brendan > >> > >> > >