Hi Amareshwari, Thanks for your replies. They are really good suggestions. But I probably have one more question remain. About HDFS, it splits the input file into 64M blocks in a sequential way by input file bytes, right? But it is against the idea to split the image into sub images by using its four corners. Is there a way to configure HDFS to make it compatible with the image split? Many thanks!
Cao On Mon, Dec 21, 2009 at 11:43 PM, Amareshwari Sri Ramadasu < amar...@yahoo-inc.com> wrote: > Hi Cao, > > My answers are inline. > > On 12/21/09 8:42 PM, "Cao Kang" <weliam...@gmail.com> wrote: > > Hi Amareshwari, > Thanks for your reply. > But another question is, where and how should I define the split > boundaries? > Should I define it in FileSplit constructor? > > I don't think you can extend FileSplit directly. I think you should write > your own split say ImageSplit, in which you can represent your image fully. > For example, FileSplit represents the split using offset and length. You > may need all four co-ordinates of your image. > > Furthermore, as far as I have seen, all examples there use longwritable to > represent the offset of that split in the input file. What is the split is > not sequential? > > Yes. FileSplit is used for representing text data. > For example, in the image split, the sub images bytes array > are not sequential from the input image. The bytes split look like this: > > |---------------|---------------| > | | | > | 1 | 2 | > | | | > |---------------|---------------| > | | | > | 3 | 4 | > | | | > |---------------|---------------| > > Each sub image split will be consisted by an array. Where and how this > should be defined in InputFormat? Many thanks. > > In your InputFormat, you should define getSplits() method which returns > your ImageSplits. > > Thanks > Amareshwari > > > On Mon, Dec 21, 2009 at 6:37 AM, Amareshwari Sri Ramadasu < > amar...@yahoo-inc.com> wrote: > > > You should implement your split to represent the split information. Then > > you should implement getSplits in InputFormat to get the splits from your > > input, which divides the whole input into chunks. Here, each split will > be > > given to a map task. > > You should also define RecordReader which reads records from the split. > Map > > task processes one record at a time. > > > > See > > > http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Job+Input > > > > Thanks > > Amareshwari > > > > On 12/21/09 2:22 AM, "Cao Kang" <cak...@clarku.edu> wrote: > > > > Hi, > > I have spent several days on the customized file input format in hadoop. > > Basically, we need split one giant square shaped image (.tif) into four > > square shaped smaller images. Where does the really split happen? Should > I > > implement it in "getSplits" function or in the "next" function? It's > quite > > confusing. > > Does anyone know or can anyone provide some examples of it? Thanks. > > > > > >