These files follow the libsvm format where each line is a record, the first column is a label, and then after that the fields are offset:value where offset is the offset into the feature vector, and value is the value of the input feature.
This is a fairly efficient representation for sparse but can double (or more) storage requirements for dense data. - Evan > On Jun 22, 2014, at 3:35 PM, Justin Yip <yipjus...@gmail.com> wrote: > > Hello, > > I am looking into a couple of MLLib data files in > https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any > explanation for these files? Does anyone know if they are documented? > > Thanks. > > Justin