Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <davidpark...@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing the
> file name. It was a bit tricky to do because of changing the generics from
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? It
> seems like the isSplittable method of this input format may need checking.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <davidpark...@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you’re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it’s creating multiple input splits for these files. I’m mapping the files
> directly off S3:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it’s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some debug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through as
> not-splittable? This seems to be a bug in hadoop code if I’m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <davidpark...@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I’ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir = *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);***
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can’t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 map
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

Reply via email to