Just found that you can specify number of features when loading libsvm source:
val df = spark.read.option("numFeatures", "100").format("libsvm") Liang-Chi Hsieh wrote > As the libsvm format can't specify number of features, and looks like > NaiveBayes doesn't have such parameter, if your training/testing data is > sparse, the number of features inferred from the data files can be > inconsistent. > > We may need to fix this. > > Before a fixing going into NaiveBayes, currently a workaround is to align > the number of features between training and testing data before fitting > the model. > > jinhong lu wrote >> After train the mode, I got the result look like this: >> >> >> scala> predictionResult.show() >> >> +-----+--------------------+--------------------+--------------------+----------+ >> |label| features| rawPrediction| >> probability|prediction| >> >> +-----+--------------------+--------------------+--------------------+----------+ >> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| >> >> 0.0| >> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| >> >> 0.0| >> | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| >> >> 1.0| >> >> And then, I transform() the data by these code: >> >> import org.apache.spark.ml.linalg.Vectors >> import org.apache.spark.ml.linalg.Vector >> import scala.collection.mutable >> >> def lineToVector(line:String ):Vector={ >> val seq = new mutable.Queue[(Int,Double)] >> val content = line.split(" "); >> for( s <- content){ >> val index = s.split(":")(0).toInt >> val value = s.split(":")(1).toDouble >> seq += ((index,value)) >> } >> return Vectors.sparse(144109, seq) >> } >> >> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, >> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line >> => >> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", >> "features") >> val predictionResult = model.transform(df) >> predictionResult.show() >> >> >> But I got the error look like this: >> >> Caused by: java.lang.IllegalArgumentException: requirement failed: You >> may not write an element to index 804201 because the declared size of >> your vector is 144109 >> at scala.Predef$.require(Predef.scala:224) >> at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219) >> at lineToVector( >> <console> >> :55) >> at $anonfun$4.apply( >> <console> >> :50) >> at $anonfun$4.apply( >> <console> >> :50) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >> at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84) >> at >> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) >> at >> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) >> at >> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) >> at >> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) >> at >> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) >> at >> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) >> >> So I change >> >> return Vectors.sparse(144109, seq) >> >> to >> >> return Vectors.sparse(804202, seq) >> >> Another error occurs: >> >> Caused by: java.lang.IllegalArgumentException: requirement failed: The >> columns of A don't match the number of elements of x. A: 144109, x: >> 804202 >> at scala.Predef$.require(Predef.scala:224) >> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521) >> at >> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110) >> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176) >> >> what should I do? >>> 在 2017年3月13日,16:31,jinhong lu < >> lujinhong2@ >> > 写道: >>> >>> Hi, all: >>> >>> I got these training data: >>> >>> 0 31607:17 >>> 0 111905:36 >>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 >>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 >>> 112109:4 123305:48 142509:1 >>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10 >>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 >>> 31607:19 >>> 0 19109:7 29705:4 123305:32 >>> 0 15309:1 43005:1 108509:1 >>> 1 604:1 6401:1 6503:1 15207:4 31607:40 >>> 0 1807:19 >>> 0 301:14 501:1 1502:14 2507:12 123305:4 >>> 0 607:14 19109:460 123305:448 >>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 >>> 128209:1 >>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 >>> 27709:2 56509:8 122705:62 123305:31 124005:2 >>> >>> And then I train the model by spark: >>> >>> import org.apache.spark.ml.classification.NaiveBayes >>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator >>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator >>> import org.apache.spark.sql.SparkSession >>> >>> val spark = >>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate() >>> val data = >>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3") >>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), >>> seed = 1234L) >>> //val model = new NaiveBayes().fit(trainingData) >>> val model = new >>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData) >>> val predictions = model.transform(testData) >>> predictions.show() >>> >>> >>> OK, I have got my model by the cole above, but how can I use this model >>> to predict the classfication of other data like these: >>> >>> ID1 509:2 5102:4 25909:1 31709:4 121905:19 >>> ID2 800201:1 >>> ID3 116005:4 >>> ID4 800201:1 >>> ID5 19109:1 21708:1 23208:1 49809:1 88609:1 >>> ID6 800201:1 >>> ID7 43505:7 106405:7 >>> >>> I know I can use the transform() method, but how to contrust the >>> parameter for transform() method? >>> >>> >>> >>> >>> >>> Thanks, >>> lujinhong >>> >> >> Thanks, >> lujinhong >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: >> dev-unsubscribe@.apache ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21180.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org