On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani <vas...@gmail.com>wrote:
> Hi Mohit, > > You can use Pig for processing XML files. PiggyBank has build in load > function to load the XML files. > Also you can specify pig.maxCombinedSplitSize and > pig.splitCombination for efficient processing. > I can't seem to find examples of how to do xml processing in Pig. Can you please send me some pointers? Basically I need to convert my xml to more structured format using hadoop to write it to database. > > On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > > On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill <bill...@gmail.com> > wrote: > > > >> I'm not sure what you mean by "flat format" here. > >> > >> In my scenario, I have an file input.xml that looks like this. > >> > >> <myfile> > >> <section> > >> <value>1</value> > >> </section> > >> <section> > >> <value>2</value> > >> </section> > >> </myfile> > >> > >> input.xml is a plain text file. Not a sequence file. If I read it with > the > >> XMLInputFormat my mapper gets called with (key, value) pairs that look > like > >> this: > >> > >> (nnnn, <section><value>1</value></section>) > >> (nnnn, <section><value>2</value></section>) > >> > >> Where the keys are numerical offsets into the file. I then use this > >> information to write a sequence file with these (key, value) pairs. So > my > >> Hadoop job that uses XMLInputFormat takes a text file as input and > produces > >> a sequence file as output. > >> > >> I don't know a rule of thumb for how many small files is too many. Maybe > >> someone else on the list can chime in. I just know that when your > >> throughput gets slow that's one possible cause to investigate. > >> > > > > I need to install hadoop. Does this xmlinput format comes as part of the > > install? Can you please give me some pointers that would help me install > > hadoop and xmlinputformat if necessary? > > > > -- > -- Srinivas > srini...@cloudwick.com >