Hi all,
I apologize first if this question has been asked before, as I had
trouble viewing the archives.
I am a GSoC student working on the Mahout project, and I was wondering
about generating SequenceFiles from CSV files. The CSV files are matrix
representations; each line corresponds to a row, and each
comma-separated value corresponds to a column. I know that
TextInputFormat will split according to each newline, but the key
provided is the byte offset, rather than the line number. Ideally, I'd
like to generate a Vector of each CSV row's elements and use the line
number as its key.
However, this byte offset could still be useful if, at the end of the
M/R task (or perhaps in the Reduce step?) I could sort all the Vectors
according to their keys and use that ordering as the matrix. Is this
possible? If not, would I need to define a new InputFormat entirely in
order to create meaningful keys and corresponding values? Or is there
another strategy (counters?) that I could use to accomplish mapping the
line numbers of the CSV files to rows in the ensuing matrix?
Thanks in advance!
Regards,
Shannon