Hello everyone,

I'm new to Hadoop and I'm trying to figure out how to design a M/R program
to parse a file and generate a PMML file as output.

What I would like to do is split a file by a keyword instead a given number
of lines because the location of the split could change from time to time.

I'm looking around and was thinking maybe KeyValueTextInputFormat would be
the way to go but I'm not finding any clear examples how to use it. So I'm
not sure if this is the right choice or not.

Here is a basic input example of what I'm working with.

[Input file info]
more info
more info
etc.
etc.
*Keyword*
different info
different info
*Keyword*
some more info

For the example above, each section can be generated separately from each
other. However, within each section, different lines are dependent upon each
other to generate a valid PMML file.

Can anyone offer a suggestion what type of input format I should use?

Thanks for your time
Erik

Reply via email to