It seems logical too that launching 4000 map tasks on a 20 node cluster is 
going to have a lot of overhead with it.  20 does not seem like the ideal 
number, but I don't really know the internals of Cassandra that well.  You 
might want to post this question on the Cassandra list to see if they can help 
you identify a way to increase the number of map tasks.

--Bobby Evans

On 11/5/11 9:33 AM, "Brendan W." <bw8...@gmail.com> wrote:

Yeah, that's my guess now, that somebody must have hacked the Cassandra
libs on me...just wanted to see if there were other possibilities for where
that parameter was being set.

Thanks a lot for the help.

On Fri, Nov 4, 2011 at 2:11 PM, Harsh J <ha...@cloudera.com> wrote:

> Could just be that Cassandra has changed the way their splits generate?
> Was Cassandra client libs changed at any point? Have you looked at its
> input formats' sources?
>
> On 04-Nov-2011, at 10:05 PM, Brendan W. wrote:
>
> > Plain Java MR, using the Cassandra inputFormat to read out of Cassandra.
> >
> > Perhaps somebody hacked the inputFormat code on me...
> >
> > But what's weird is that the parameter mapred.map.tasks didn't appear in
> > the job confs before at all.  Now it does, with a value of 20 (happens to
> > be the # of machines in the cluster), and that's without the jobs or the
> > mapred-site.xml files themselves changing.
> >
> > The inputSplitSize is set specifically in the jobs, and has not been
> > changed (except I subsequently fiddled with it a little to see if it
> > affected the fact that I was getting 20 splits, and it didn't affect
> > that...just the split size, not the number).
> >
> > After a submit the job, I get a message "TOTAL NUMBER OF SPLIT = 20",
> > before a list of the input splits...sort of looks like a hack but I can't
> > find where it is.
> >
> > On Fri, Nov 4, 2011 at 11:58 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Brendan,
> >>
> >> Are these jobs (whose split behavior has changed) via Hive/etc. or plain
> >> Java MR?
> >>
> >> In case its the former, do you have users using newer versions of them?
> >>
> >> On 04-Nov-2011, at 8:03 PM, Brendan W. wrote:
> >>
> >>> Hi,
> >>>
> >>> In the jobs running on my cluster of 20 machines, I used to run jobs
> (via
> >>> "hadoop jar ...") that would spawn around 4000 map tasks.  Now when I
> run
> >>> the same jobs, that number is 20; and I notice that in the job
> >>> configuration, the parameter mapred.map.tasks is set to 20, whereas it
> >>> never used to be present at all in the configuration file.
> >>>
> >>> Changing the input split size in the job doesn't affect this--I get the
> >>> size split I ask for, but the *number* of input splits is still capped
> at
> >>> 20--i.e., the job isn't reading all of my data.
> >>>
> >>> The mystery to me is where this parameter could be getting set.  It is
> >> not
> >>> present in the mapred-site.xml file in <hadoop home>/conf on any
> machine
> >> in
> >>> the cluster, and it is not being set in the job (I'm running out of the
> >>> same jar I always did; no updates).
> >>>
> >>> Is there *anywhere* else this parameter could possibly be getting set?
> >>> I've stopped and restarted map-reduce on the cluster with no
> >> effect...it's
> >>> getting re-read in from somewhere, but I can't figure out where.
> >>>
> >>> Thanks a lot,
> >>>
> >>> Brendan
> >>
> >>
>
>

Reply via email to