Hi All I'm sharing my understanding here. Please correct me if I'm wrong (Uma and Michael). The explanation by Michael is the common working of map reduce programs I believe. Just take case of a common text file of size 96MB and if my HDFS block size is 64 MB then this file would be split across 2 blocks block A(64 MB) and block B(32 MB). This splitting and storing in hdfs would be happening just based on the size and never based on any end of line characters. Which means that the last line may not be completely in block A , part in Block A and rest in block B. Now the file is stored in HDFS this way. When we try to process the HDFS stored file using map reduce (say using default TextInputFormat) there would be two mappers spanned by JT, mapper-A and mapper-B. Mapper-A would be reading Block A and when it reaches the last line it wont be getting the line delimiter so it would read the details till the first line delimiter in Block B. Mapper B would start processing Block B only from the first line delimiter. Now the mappers understands whether the blocks that they are reading are the first block or intermediate blocks of a file from the offset, if offset is 0000 then it is the first block of a file. Please add on if there are more parameters considered for the same other than just offset like some meta information as well. So we don't need a custom input format/record reader here for the default behavior to read end of a line/record. Such a processing would hardly make sense while processing complex xmls as xmls are based fully on parent child relation ship. (it would work well for simple XMLs just having one level of hirearchy). Say for example consider the mock XML like below
<Vehicle> <Car> <BMW> <Sedan> <3-Series> <min-torque></min-torque> ----------------------------------------------------------------------------------------------------------------------------------- <max-torque></max-torque> </3-Series <Sedan> <SUV> </SUV </BMW> </Car> <Truck> </Truck> <Bus> <Bus> </Vehicle> Even if we split it in between(even if split happens at a line boundary) it would be hard to process as the opening tags come in one block under one mapper's boundary and the closing tags come in another block under another mapper's boundary. So if we are mining some data from them it hardly makes sense. We need to incorporate the logic in here interns of regex or so to identify the closing tags from second block, May be one query remains, why use map reduce for XML if we can't exploit parallel processing? - We can process multiple small xml files in parallel one in each mapper without splitting to mine and extract some information for processing. But we lose a good extent of data locality here. There is a sample user defined input format given in Hadoop Definitive Guide called WholeFileInputFormat which would satisfy this purpose. - For larger xml files we have to consider processing the splits in parallel itself. There is a default class provided in hadoop for the same, StreamXmlRecordReader which can be used outside of steaming as well. For details i have posted the http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html Hope it helps!.. Regards Bejoy.K.S On Tue, Nov 22, 2011 at 9:31 AM, Inder Pall <inder.p...@gmail.com> wrote: > what about the records at skipped boundaries? > Instead is there a way to define a custom splitter in hadoop which can > understand record boundaries. > > - Inder > > On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <michael_se...@hotmail.com > >wrote: > > > > > Just wanted to address this: > > > >Basically in My mapreduce program i am expecting a complete XML as my > > > >input.i have a CustomReader(for XML) in my mapreduce job > > configuration.My > > > >main confusion is if namenode distribute data to DataNodes ,there is a > > > >chance that a part of xml can go to one data node and other half can > go > > in > > > >another datanode.If that is the case will my custom XMLReader in the > > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > > >Please help me on this? > > > > > > if you can not do anything parallel here, make your input split size to > > cover complete file size. > > > > > also configure the block size to cover complete file size. In this > > case, only one mapper and reducer will be spawned for file. But here you > > wont get any parallel processing advantage. > > > > > > > You can do this in parallel. > > You need to write a custom input format class. (Which is what you're > > already doing...) > > > > Lets see if I can explain this correctly. > > You have an XML record split across block A and block B. > > > > Your map reduce job will instantiate a task per block. > > So in mapper processing block A, you read and process the XML records... > > when you get to the last record, which is only in part of A, mapper A > will > > continue on to block B and continue reading the last record. Then stops. > > In mapper for block B, the reader will skip and not process data until it > > sees the start of a record. So you end up getting all of your XML records > > processed (no duplication) and done in parallel. > > > > Does that make sense? > > > > -Mike > > > > > > > Date: Tue, 22 Nov 2011 03:08:20 +0000 > > > From: mahesw...@huawei.com > > > Subject: RE: Regarding loading a big XML file to HDFS > > > To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org > > > > > > Also i am surprising, how you are writing mapreduce application here. > > Map and reduce will work with key value pairs. > > > ________________________________________ > > > From: Uma Maheswara Rao G > > > Sent: Tuesday, November 22, 2011 8:33 AM > > > To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org > > > Subject: RE: Regarding loading a big XML file to HDFS > > > > > > >______________________________________ > > > >From: hari708 [hari...@gmail.com] > > > >Sent: Tuesday, November 22, 2011 6:50 AM > > > >To: core-u...@hadoop.apache.org > > > >Subject: Regarding loading a big XML file to HDFS > > > > > > >Hi, > > > >I have a big file consisting of XML data.the XML is not represented > as a > > > >single line in the file. if we stream this file using ./hadoop dfs > -put > > > >command to a hadoop directory .How the distribution happens.? > > > > > > HDFS will didvide the blocks based on your block size configured for > the > > file. > > > > > > >Basically in My mapreduce program i am expecting a complete XML as my > > > >input.i have a CustomReader(for XML) in my mapreduce job > > configuration.My > > > >main confusion is if namenode distribute data to DataNodes ,there is a > > > >chance that a part of xml can go to one data node and other half can > go > > in > > > >another datanode.If that is the case will my custom XMLReader in the > > > >mapreduce be able to combine it(as mapreduce reads data locally only). > > > >Please help me on this? > > > > > > if you can not do anything parallel here, make your input split size to > > cover complete file size. > > > also configure the block size to cover complete file size. In this > case, > > only one mapper and reducer will be spawned for file. But here you wont > get > > any parallel processing advantage. > > > > > > >-- > > > >View this message in context: > > http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS- > > >tp32871900p32871900.html > > > >Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > > > > > -- > -- Inder >