I've used the Mahout XMLInputFormat. It is the right tool if you have an
XML file with one type of section repeated over and over again and want to
turn that into Sequence file where each repeated section is a value. I've
found it helpful as a preprocessing step for converting raw XML input into
something that can be handled by Hadoop jobs.

If you're worried about having lots of small files--specifically, about
overwhelming your namenode because you have too many small
files--the XMLInputFormat won't help with that. However, it may be possible
to concatenate the small files into larger files, then have a Hadoop job
that uses XMLInputFormat transform the large files into sequence files.

Reply via email to