Re: Regarding loading a big XML file to HDFS

Bejoy Ks Mon, 21 Nov 2011 23:33:50 -0800

Hi All
           I'm sharing my understanding here. Please correct me if I'm
wrong (Uma and Michael).
              The explanation by Michael  is the common working of map
reduce programs I believe. Just take case of a common text file of size
96MB and if my HDFS block size is 64 MB then this file would be split
across 2 blocks block A(64 MB) and block B(32 MB). This splitting and
storing in hdfs would be happening just based on the size and never based
on any end of line characters. Which means that the last line may not be
completely in block A , part in Block A and rest in block B. Now the file
is stored in HDFS this way.
             When we try to process the HDFS stored file using map reduce
(say using default TextInputFormat) there would be two mappers spanned by
JT, mapper-A and mapper-B. Mapper-A would be reading Block A and when it
reaches the last line it wont be getting the line delimiter so it would
read the details till the first line delimiter in Block B. Mapper B would
start processing Block B only from the first line delimiter. Now the
mappers understands whether the blocks that they are reading are the first
block or intermediate blocks of a file from the offset, if offset is 0000
then it is the first block of a file. Please add on if there are more
parameters considered for the same other than just offset like some meta
information as well. So we don't need a custom input format/record reader
here for the default behavior to read end of a line/record.
            Such a processing would hardly make sense while processing
complex xmls as xmls are based fully on parent child relation ship. (it
would work well for simple XMLs just having one level of hirearchy). Say
for example consider the mock XML like below


<Vehicle>
    <Car>
        <BMW>
            <Sedan>
                <3-Series>
                    <min-torque></min-torque>
-----------------------------------------------------------------------------------------------------------------------------------
                    <max-torque></max-torque>
                </3-Series
            <Sedan>
            <SUV>
            </SUV
        </BMW>
    </Car>
    <Truck>
    </Truck>
    <Bus>
    <Bus>
</Vehicle>

Even if we split it  in between(even if split happens at a line boundary)
it would be hard to process as the opening tags come in one block under one
mapper's boundary and the closing tags come in another block under another
mapper's boundary. So if we are mining some data from them it hardly makes
sense. We need to incorporate the logic in here interns of regex or so to
identify the closing tags from second block,
 May be one query remains, why use map reduce for XML if we can't exploit
parallel processing?
- We can process multiple small xml files in parallel one in each mapper
without splitting to mine and extract some information for processing. But
we lose a good extent of data locality here.
There is a sample user defined input format given in Hadoop Definitive
Guide called WholeFileInputFormat which would satisfy this purpose.

- For larger xml files we have to consider processing the splits in
parallel itself.
There is a default class provided in hadoop for the same,
StreamXmlRecordReader which can be used outside of steaming as well. For
details i have posted the
http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html

Hope it helps!..

Regards
Bejoy.K.S

On Tue, Nov 22, 2011 at 9:31 AM, Inder Pall <inder.p...@gmail.com> wrote:

> what about the records at skipped boundaries?
> Instead is there a way to define a custom splitter in hadoop which can
> understand record boundaries.
>
> - Inder
>
> On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <michael_se...@hotmail.com
> >wrote:
>
> >
> > Just wanted to address this:
> > > >Basically in My mapreduce program i am expecting a complete XML as my
> > > >input.i have a CustomReader(for XML) in my mapreduce job
> > configuration.My
> > > >main confusion is if namenode distribute data to DataNodes ,there is a
> > > >chance that a part of xml can go to one data node and other half can
> go
> > in
> > > >another datanode.If that is the case will my custom XMLReader in the
> > > >mapreduce be able to combine it(as mapreduce reads data locally only).
> > > >Please help me on this?
> > >
> > > if you can not do anything parallel here, make your input split size to
> > cover complete file size.
> > >
> >  also configure the block size to cover complete file size. In this
> > case, only one mapper and reducer will be spawned for file. But here you
> >  wont get any parallel processing advantage.
> > >
> >
> > You can do this in parallel.
> > You need to write a custom input format class. (Which is what you're
> > already doing...)
> >
> > Lets see if I can explain this correctly.
> > You have an XML record split across block A and block B.
> >
> > Your map reduce job will instantiate a task per block.
> > So in mapper processing block A, you read and process the XML records...
> > when you get to the last record, which is only in part of A, mapper A
> will
> > continue on to block B and continue reading the last record. Then stops.
> > In mapper for block B, the reader will skip and not process data until it
> > sees the start of a record. So you end up getting all of your XML records
> > processed (no duplication) and done in parallel.
> >
> > Does that make sense?
> >
> > -Mike
> >
> >
> > > Date: Tue, 22 Nov 2011 03:08:20 +0000
> > > From: mahesw...@huawei.com
> > > Subject: RE: Regarding loading a big XML file to HDFS
> > > To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org
> > >
> > > Also i am surprising, how you are writing mapreduce application here.
> > Map and reduce will work with key value pairs.
> > > ________________________________________
> > > From: Uma Maheswara Rao G
> > > Sent: Tuesday, November 22, 2011 8:33 AM
> > > To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org
> > > Subject: RE: Regarding loading a big XML file to HDFS
> > >
> > > >______________________________________
> > > >From: hari708 [hari...@gmail.com]
> > > >Sent: Tuesday, November 22, 2011 6:50 AM
> > > >To: core-u...@hadoop.apache.org
> > > >Subject: Regarding loading a big XML file to HDFS
> > >
> > > >Hi,
> > > >I have a big file consisting of XML data.the XML is not represented
> as a
> > > >single line in the file. if we stream this file using ./hadoop dfs
> -put
> > > >command to a hadoop directory .How the distribution happens.?
> > >
> > > HDFS will didvide the blocks based on your block size configured for
> the
> > file.
> > >
> > > >Basically in My mapreduce program i am expecting a complete XML as my
> > > >input.i have a CustomReader(for XML) in my mapreduce job
> > configuration.My
> > > >main confusion is if namenode distribute data to DataNodes ,there is a
> > > >chance that a part of xml can go to one data node and other half can
> go
> > in
> > > >another datanode.If that is the case will my custom XMLReader in the
> > > >mapreduce be able to combine it(as mapreduce reads data locally only).
> > > >Please help me on this?
> > >
> > > if you can not do anything parallel here, make your input split size to
> > cover complete file size.
> > > also configure the block size to cover complete file size. In this
> case,
> > only one mapper and reducer will be spawned for file. But here you wont
> get
> > any parallel processing advantage.
> > >
> > > >--
> > > >View this message in context:
> > http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS-
> > >tp32871900p32871900.html
> > > >Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >
> >
> >
>
>
>
> --
> -- Inder
>

Re: Regarding loading a big XML file to HDFS

Reply via email to