David, You are using FileNameTextInputFormat. This is not in Hadoop source, as far as I can see. Can you please confirm where this is being used from ? It seems like the isSplittable method of this input format may need checking.
Another thing, given you are adding the same input format for all files, do you need MultipleInputs ? Thanks Hemanth On Thu, Dec 6, 2012 at 1:06 PM, David Parks <davidpark...@yahoo.com> wrote: > I believe I just tracked down the problem, maybe you can help confirm if > you’re familiar with this.**** > > ** ** > > I see that FileInputFormat is specifying that gzip files (.gz extension) > from s3n filesystem are being reported as *splittable*, and I see that > it’s creating multiple input splits for these files. I’m mapping the files > directly off S3:**** > > ** ** > > Path lsDir = *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.* > class*, LinkShareCatalogImportMapper.*class*);**** > > ** ** > > I see in the map phase, based on my counters, that it’s actually > processing the entire file (I set up a counter per file input). So the 2 > files which were processed twice had 2 splits (I now see that in some debug > logs I created), and the 1 file that was processed 3 times had 3 splits > (the rest were smaller and were only assigned one split by default anyway). > **** > > ** ** > > Am I wrong in expecting all files on the s3n filesystem to come through as > not-splittable? This seems to be a bug in hadoop code if I’m right.**** > > ** ** > > David**** > > ** ** > > ** ** > > *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com] > *Sent:* Thursday, December 06, 2012 1:45 PM > *To:* user@hadoop.apache.org > *Subject:* Re: Map tasks processing some files multiple times**** > > ** ** > > Could it be due to spec-ex? Does it make a diffrerence in the end?**** > > ** ** > > Raj**** > > ** ** > ------------------------------ > > *From:* David Parks <davidpark...@yahoo.com> > *To:* user@hadoop.apache.org > *Sent:* Wednesday, December 5, 2012 10:15 PM > *Subject:* Map tasks processing some files multiple times**** > > ** ** > > I’ve got a job that reads in 167 files from S3, but 2 of the files are > being mapped twice and 1 of the files is mapped 3 times.**** > > **** > > This is the code I use to set up the mapper:**** > > **** > > Path lsDir = *new* Path( > "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");**** > > *for*(FileStatus f : > lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified > linkshare catalog: " + f.getPath().toString());**** > > *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0 > ){**** > > MultipleInputs.*addInputPath*(job, lsDir, > FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);*** > * > > }**** > > **** > > I can see from the logs that it sees only 1 copy of each of these files, > and correctly identifies 167 files.**** > > **** > > I also have the following confirmation that it found the 167 files > correctly:**** > > **** > > 2012-12-06 04:56:41,213 INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input > paths to process : 167**** > > **** > > When I look through the syslogs I can see that the file in question was > opened by two different map attempts:**** > > **** > > ./task-attempts/job_201212060351_0001/* > attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO > org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening > 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' > for reading**** > > ./task-attempts/job_201212060351_0001/* > attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO > org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening > 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' > for reading**** > > **** > > This is only happening to these 3 files, all others seem to be fine. For > the life of me I can’t see a reason why these files might be processed > multiple times.**** > > **** > > Notably, map attempt 173 is more map attempts than should be possible. > There are 167 input files (from S3, gzipped), thus there should be 167 map > attempts. But I see a total of 176 map tasks.**** > > **** > > Any thoughts/ideas/guesses?**** > > **** > > ** ** > >