Hi all,

I apologize first if this question has been asked before, as I had trouble viewing the archives.

I am a GSoC student working on the Mahout project, and I was wondering about generating SequenceFiles from CSV files. The CSV files are matrix representations; each line corresponds to a row, and each comma-separated value corresponds to a column. I know that TextInputFormat will split according to each newline, but the key provided is the byte offset, rather than the line number. Ideally, I'd like to generate a Vector of each CSV row's elements and use the line number as its key.

However, this byte offset could still be useful if, at the end of the M/R task (or perhaps in the Reduce step?) I could sort all the Vectors according to their keys and use that ordering as the matrix. Is this possible? If not, would I need to define a new InputFormat entirely in order to create meaningful keys and corresponding values? Or is there another strategy (counters?) that I could use to accomplish mapping the line numbers of the CSV files to rows in the ensuing matrix?

Thanks in advance!

Regards,
Shannon

Reply via email to