Also i am surprising, how you are writing mapreduce application here. Map and 
reduce will work with key value pairs.
________________________________________
From: Uma Maheswara Rao G
Sent: Tuesday, November 22, 2011 8:33 AM
To: common-user@hadoop.apache.org; core-u...@hadoop.apache.org
Subject: RE: Regarding loading a big XML file to HDFS

>______________________________________
>From: hari708 [hari...@gmail.com]
>Sent: Tuesday, November 22, 2011 6:50 AM
>To: core-u...@hadoop.apache.org
>Subject: Regarding loading a big XML file to HDFS

>Hi,
>I have a big file consisting of XML data.the XML is not represented as a
>single line in the file. if we stream this file using ./hadoop dfs -put
>command to a hadoop directory .How the distribution happens.?

HDFS will didvide the blocks based on your block size configured for the file.

>Basically in My mapreduce program i am expecting a complete XML as my
>input.i have a CustomReader(for XML) in my mapreduce job configuration.My
>main confusion is if namenode distribute data to DataNodes ,there is a
>chance that a part of xml can go to one data node and other half can go in
>another datanode.If that is the case will my custom XMLReader in the
>mapreduce be able to combine it(as mapreduce reads data locally only).
>Please help me on this?

if you can not do anything parallel here, make your input split size to cover 
complete file size.
also configure the block size to cover complete file size. In this case, only 
one mapper and reducer will be spawned for file. But here you wont get any 
parallel processing advantage.

>--
>View this message in context: 
>http://old.nabble.com/Regarding-loading-a-big-XML-file-to-HDFS->tp32871900p32871900.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to