Hi All,We have a fairly large amount of sparse data. I was following the
following instructions in the manual:
Sparse dataIt is very common in practice to have sparse training data. MLlib
supports reading training examples stored in LIBSVM format, which is the
default format used by LIBSVM and LIBLINEAR. It is a text format in which each
line represents a labeled sparse feature vector using the following
format:label index1:value1 index2:value2 ...
import org.apache.spark.mllib.regression.LabeledPointimport
org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.rdd.RDD
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
"data/mllib/sample_libsvm_data.txt")
I believe that I have formatted my data as per the required Libsvm format. Here
is a snippet of that:
1 122:1 1693:1 1771:1 1974:1 2334:1
2378:1 2562:1 1 118:1 1389:1 1413:1 1454:1
1780:1 2562:1 5051:1 5417:1 5548:1
5798:1 5862:1 0 150:1 214:1 468:1 1013:1
1078:1 1092:1 1117:1 1489:1 1546:1 1630:1
1635:1 1827:1 2024:1 2215:1 2478:1
2761:1 5985:1 6115:1 6218:1 0 251:1 5578:1
However,When I use MLUtils.loadLibSVMFile(sc, "path-to-data-file")I get the
following error messages in mt spark-shell. Can someone please point me in
right direction.
java.lang.NumberFormatException: For input string: "150:1 214:1
468:1 1013:1 1078:1 1092:1 1117:1 1489:1
1546:1 1630:1 1635:1 1827:1 2024:1 2215:1
2478:1 2761:1 5985:1 6115:1 6218:1"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.parseDouble(Double.java:540) at
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)