Hi Jinhong,
Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. >From the test dataset you posted, the vector dimension is much larger than 144109 (804202?). Regards, Yuhao 2017-03-13 4:59 GMT-07:00 jinhong lu <lujinho...@gmail.com>: > Anyone help? > > > 在 2017年3月13日,19:38,jinhong lu <lujinho...@gmail.com> 写道: > > > > After train the mode, I got the result look like this: > > > > > > scala> predictionResult.show() > > +-----+--------------------+--------------------+----------- > ---------+----------+ > > |label| features| rawPrediction| > probability|prediction| > > +-----+--------------------+--------------------+----------- > ---------+----------+ > > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| > 0.0| > > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| > 0.0| > > | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| > 1.0| > > > > And then, I transform() the data by these code: > > > > import org.apache.spark.ml.linalg.Vectors > > import org.apache.spark.ml.linalg.Vector > > import scala.collection.mutable > > > > def lineToVector(line:String ):Vector={ > > val seq = new mutable.Queue[(Int,Double)] > > val content = line.split(" "); > > for( s <- content){ > > val index = s.split(":")(0).toInt > > val value = s.split(":")(1).toDouble > > seq += ((index,value)) > > } > > return Vectors.sparse(144109, seq) > > } > > > > val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, > org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/ > gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line > => > (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", > "features") > > val predictionResult = model.transform(df) > > predictionResult.show() > > > > > > But I got the error look like this: > > > > Caused by: java.lang.IllegalArgumentException: requirement failed: You > may not write an element to index 804201 because the declared size of your > vector is 144109 > > at scala.Predef$.require(Predef.scala:224) > > at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219) > > at lineToVector(<console>:55) > > at $anonfun$4.apply(<console>:50) > > at $anonfun$4.apply(<console>:50) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$ > GeneratedIterator.processNext(generated.java:84) > > at org.apache.spark.sql.execution.BufferedRowIterator. > hasNext(BufferedRowIterator.java:43) > > at org.apache.spark.sql.execution.WholeStageCodegenExec$$ > anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > 4.apply(SparkPlan.scala:246) > > at org.apache.spark.sql.execution.SparkPlan$$anonfun$ > 4.apply(SparkPlan.scala:240) > > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$ > 1$$anonfun$apply$24.apply(RDD.scala:803) > > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$ > 1$$anonfun$apply$24.apply(RDD.scala:803) > > > > So I change > > > > return Vectors.sparse(144109, seq) > > > > to > > > > return Vectors.sparse(804202, seq) > > > > Another error occurs: > > > > Caused by: java.lang.IllegalArgumentException: requirement > failed: The columns of A don't match the number of elements of x. A: > 144109, x: 804202 > > at scala.Predef$.require(Predef.scala:224) > > at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521) > > at org.apache.spark.ml.linalg.Matrix$class.multiply( > Matrices.scala:110) > > at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices. > scala:176) > > > > what should I do? > >> 在 2017年3月13日,16:31,jinhong lu <lujinho...@gmail.com> 写道: > >> > >> Hi, all: > >> > >> I got these training data: > >> > >> 0 31607:17 > >> 0 111905:36 > >> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 > 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 > 112109:4 123305:48 142509:1 > >> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10 > >> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 > 15207:19 31607:19 > >> 0 19109:7 29705:4 123305:32 > >> 0 15309:1 43005:1 108509:1 > >> 1 604:1 6401:1 6503:1 15207:4 31607:40 > >> 0 1807:19 > >> 0 301:14 501:1 1502:14 2507:12 123305:4 > >> 0 607:14 19109:460 123305:448 > >> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 > 123305:48 128209:1 > >> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 > 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2 > >> > >> And then I train the model by spark: > >> > >> import org.apache.spark.ml.classification.NaiveBayes > >> import org.apache.spark.ml.evaluation. > BinaryClassificationEvaluator > >> import org.apache.spark.ml.evaluation. > MulticlassClassificationEvaluator > >> import org.apache.spark.sql.SparkSession > >> > >> val spark = SparkSession.builder.appName("NaiveBayesExample"). > getOrCreate() > >> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/ > training_data3") > >> val Array(trainingData, testData) = data.randomSplit(Array(0.7, > 0.3), seed = 1234L) > >> //val model = new NaiveBayes().fit(trainingData) > >> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit( > trainingData) > >> val predictions = model.transform(testData) > >> predictions.show() > >> > >> > >> OK, I have got my model by the cole above, but how can I use this model > to predict the classfication of other data like these: > >> > >> ID1 509:2 5102:4 25909:1 31709:4 121905:19 > >> ID2 800201:1 > >> ID3 116005:4 > >> ID4 800201:1 > >> ID5 19109:1 21708:1 23208:1 49809:1 88609:1 > >> ID6 800201:1 > >> ID7 43505:7 106405:7 > >> > >> I know I can use the transform() method, but how to contrust the > parameter for transform() method? > >> > >> > >> > >> > >> > >> Thanks, > >> lujinhong > >> > > > > Thanks, > > lujinhong > > > > Thanks, > lujinhong > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >