Hello everyone, I'm new to Hadoop and I'm trying to figure out how to design a M/R program to parse a file and generate a PMML file as output.
What I would like to do is split a file by a keyword instead a given number of lines because the location of the split could change from time to time. I'm looking around and was thinking maybe KeyValueTextInputFormat would be the way to go but I'm not finding any clear examples how to use it. So I'm not sure if this is the right choice or not. Here is a basic input example of what I'm working with. [Input file info] more info more info etc. etc. *Keyword* different info different info *Keyword* some more info For the example above, each section can be generated separately from each other. However, within each section, different lines are dependent upon each other to generate a valid PMML file. Can anyone offer a suggestion what type of input format I should use? Thanks for your time Erik