Re: Spark 1.3.1 Dataframe breaking ALS.train?

ayan guha Tue, 21 Apr 2015 14:52:53 -0700

Thank you all.
On 22 Apr 2015 04:29, "Xiangrui Meng" <men...@gmail.com> wrote:


> SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
> 1.3. We should allow DataFrames in ALS.train. I will submit a patch.
> You can use `ALS.train(training.rdd, ...)` for now as a workaround.
> -Xiangrui
>
> On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
> > Hi Ayan,
> >
> > If you want to use DataFrame, then you should use the Pipelines API
> > (org.apache.spark.ml.*) which will take DataFrames:
> >
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS
> >
> > In the examples/ directory for ml/, you can find a MovieLensALS example.
> >
> > Good luck!
> > Joseph
> >
> > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha <guha.a...@gmail.com> wrote:
> >>
> >> Hi
> >>
> >> I am getting an error
> >>
> >> Also, I am getting an error in mlib.ALS.train function when passing
> >> dataframe (do I need to convert the DF to RDD?)
> >>
> >> Code:
> >> training = ssc.sql("select userId,movieId,rating from ratings where
> >> partitionKey < 6").cache()
> >> print type(training)
> >> model = ALS.train(training,rank,numIter,lmbda)
> >>
> >> Error:
> >> <class 'pyspark.sql.dataframe.DataFrame'>
> >>
> >> Traceback (most recent call last):
> >>   File "D:\Project\Spark\code\movie_sql.py", line 109, in <module>
> >>     bestConf =
> getBestModel(sc,ssc,training,validation,validationNoRating)
> >>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> >>     model = ALS.train(trainingRDD,rank,numIter,lmbda)
> >>   File
> >>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >> line 139, in train
> >>     model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> >> iterations,
> >>   File
> >>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >> line 127, in _prepare
> >>     assert isinstance(ratings, RDD), "ratings should be RDD"
> >> AssertionError: ratings should be RDD
> >>
> >> It was working fine in 1.2.0 (till last night :))
> >>
> >> Any solution? I am thinking to map the training dataframe back to a RDD,
> >> byt will lose the schema information.
> >>
> >> Best
> >> Ayan
> >>
> >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha <guha.a...@gmail.com>
> wrote:
> >>>
> >>> Hi
> >>> Just upgraded to Spark 1.3.1.
> >>>
> >>> I am getting an warning
> >>>
> >>> Warning (from warnings module):
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
> >>> line 191
> >>>     warnings.warn("inferSchema is deprecated, please use
> createDataFrame
> >>> instead")
> >>> UserWarning: inferSchema is deprecated, please use createDataFrame
> >>> instead
> >>>
> >>> However, documentation still says to use inferSchema.
> >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
> >>> section
> >>>
> >>> Also, I am getting an error in mlib.ALS.train function when passing
> >>> dataframe (do I need to convert the DF to RDD?)
> >>>
> >>> Code:
> >>> training = ssc.sql("select userId,movieId,rating from ratings where
> >>> partitionKey < 6").cache()
> >>> print type(training)
> >>> model = ALS.train(training,rank,numIter,lmbda)
> >>>
> >>> Error:
> >>> <class 'pyspark.sql.dataframe.DataFrame'>
> >>> Rank:8 Lmbda:1.0 iteration:10
> >>>
> >>> Traceback (most recent call last):
> >>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in <module>
> >>>     bestConf =
> >>> getBestModel(sc,ssc,training,validation,validationNoRating)
> >>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> >>>     model = ALS.train(trainingRDD,rank,numIter,lmbda)
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >>> line 139, in train
> >>>     model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> >>> iterations,
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >>> line 127, in _prepare
> >>>     assert isinstance(ratings, RDD), "ratings should be RDD"
> >>> AssertionError: ratings should be RDD
> >>>
> >>> --
> >>> Best Regards,
> >>> Ayan Guha
> >>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Ayan Guha
> >
> >
>

Re: Spark 1.3.1 Dataframe breaking ALS.train?

Reply via email to