As Sean mentions, if you can change the data to the standard format, that's probably a good idea. If you'd rather read the data raw, then writing your own version of loadLibSVMFile - then you could make your own loader function which is very similar to the existing one with a few characters removed:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L81 You will also likely need to change the logic where it determines the number of features (currently line 95) On Tue, Jul 8, 2014 at 12:22 AM, Sean Owen <so...@cloudera.com> wrote: > On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) < > zhengbing...@huawei.com> wrote: > > > > > 1) I download the imdb data from > > http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test > > LBFGS > > 2) I find the imdb data are zero-based-index data > > > > Since the method is for parsing the LIBSVM format, and its labels are > always 1-indexed IIUC, I don't think it would make sense to read 0-indexed > labels. It sounds like that input is not properly formatted, unless anyone > knows to the contrary? >