Re: Processing small xml files

2012-02-18 Thread Mohit Anchlia
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote: Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. I

Re: Processing small xml files

2012-02-17 Thread Mohit Anchlia
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile

Re: Processing small xml files

2012-02-17 Thread Srinivas Surasani
Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue,

Re: Processing small xml files

2012-02-14 Thread W.P. McNeill
I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the

Re: Processing small xml files

2012-02-14 Thread Rohit
Hi Mohit, How many are too many for namenode? We have around 100M files and 100M files every year The name-node stores file and block metadata in RAM. This is an estimate at memory utilization per file and block: Estimates show that the name-node uses fewer than 200 bytes to store

Re: Processing small xml files

2012-02-12 Thread W.P. McNeill
I've used the Mahout XMLInputFormat. It is the right tool if you have an XML file with one type of section repeated over and over again and want to turn that into Sequence file where each repeated section is a value. I've found it helpful as a preprocessing step for converting raw XML input into

Re: Processing small xml files

2012-02-12 Thread Mohit Anchlia
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill bill...@gmail.com wrote: I've used the Mahout XMLInputFormat. It is the right tool if you have an XML file with one type of section repeated over and over again and want to turn that into Sequence file where each repeated section is a value. I've