Re: how to construct parameter for model.transform() from datafile

Yuhao Yang Tue, 14 Mar 2017 19:11:46 -0700

Hi Jinhong,


Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.

>From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu <lujinho...@gmail.com>:

> Anyone help?
>
> > 在 2017年3月13日，19:38，jinhong lu <lujinho...@gmail.com> 写道：
> >
> > After train the mode, I got the result look like this:
> >
> >
> >       scala>  predictionResult.show()
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |label|            features|       rawPrediction|
>  probability|prediction|
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>      1.0|
> >
> > And then, I transform() the data by these code:
> >
> >       import org.apache.spark.ml.linalg.Vectors
> >       import org.apache.spark.ml.linalg.Vector
> >       import scala.collection.mutable
> >
> >          def lineToVector(line:String ):Vector={
> >           val seq = new mutable.Queue[(Int,Double)]
> >           val content = line.split(" ");
> >           for( s <- content){
> >             val index = s.split(":")(0).toInt
> >             val value = s.split(":")(1).toDouble
> >              seq += ((index,value))
> >           }
> >           return Vectors.sparse(144109, seq)
> >         }
> >
> >        val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => 
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> >        val predictionResult = model.transform(df)
> >        predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> >  at scala.Predef$.require(Predef.scala:224)
> >  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> >  at lineToVector(<console>:55)
> >  at $anonfun$4.apply(<console>:50)
> >  at $anonfun$4.apply(<console>:50)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> >  at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> >  at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> >       return Vectors.sparse(144109, seq)
> >
> > to
> >
> >       return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> >       Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> >         at scala.Predef$.require(Predef.scala:224)
> >         at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> >         at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> >         at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日，16:31，jinhong lu <lujinho...@gmail.com> 写道：
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >>      0 31607:17
> >>      0 111905:36
> >>      0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >>      0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >>      0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >>      0 19109:7 29705:4 123305:32
> >>      0 15309:1 43005:1 108509:1
> >>      1 604:1 6401:1 6503:1 15207:4 31607:40
> >>      0 1807:19
> >>      0 301:14 501:1 1502:14 2507:12 123305:4
> >>      0 607:14 19109:460 123305:448
> >>      0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >>      1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >>      import org.apache.spark.ml.classification.NaiveBayes
> >>      import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >>      import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >>      import org.apache.spark.sql.SparkSession
> >>
> >>      val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >>      val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >>      val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >>      //val model = new NaiveBayes().fit(trainingData)
> >>      val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >>      val predictions = model.transform(testData)
> >>      predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >>      ID1     509:2 5102:4 25909:1 31709:4 121905:19
> >>      ID2     800201:1
> >>      ID3     116005:4
> >>      ID4     800201:1
> >>      ID5     19109:1  21708:1 23208:1 49809:1 88609:1
> >>      ID6     800201:1
> >>      ID7     43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: how to construct parameter for model.transform() from datafile

Reply via email to