On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote:
Hi Mohit,
You can use Pig for processing XML files. PiggyBank has build in load
function to load the XML files.
Also you can specify pig.maxCombinedSplitSize and
pig.splitCombination for efficient processing.
I
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote:
I'm not sure what you mean by flat format here.
In my scenario, I have an file input.xml that looks like this.
myfile
section
value1/value
/section
section
value2/value
/section
/myfile
Hi Mohit,
You can use Pig for processing XML files. PiggyBank has build in load
function to load the XML files.
Also you can specify pig.maxCombinedSplitSize and
pig.splitCombination for efficient processing.
On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
On Tue,
I'm not sure what you mean by flat format here.
In my scenario, I have an file input.xml that looks like this.
myfile
section
value1/value
/section
section
value2/value
/section
/myfile
input.xml is a plain text file. Not a sequence file. If I read it with the
Hi Mohit,
How many are too many for namenode? We have around 100M files and 100M
files every year
The name-node stores file and block metadata in RAM.
This is an estimate at memory utilization per file and block:
Estimates show that the name-node uses fewer than 200 bytes to store
I've used the Mahout XMLInputFormat. It is the right tool if you have an
XML file with one type of section repeated over and over again and want to
turn that into Sequence file where each repeated section is a value. I've
found it helpful as a preprocessing step for converting raw XML input into
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill bill...@gmail.com wrote:
I've used the Mahout XMLInputFormat. It is the right tool if you have an
XML file with one type of section repeated over and over again and want to
turn that into Sequence file where each repeated section is a value. I've