Re: Using Dataframe API vs. RDD API?

Daniel O' Shaughnessy Fri, 05 Jan 2018 07:37:50 -0800

Hi Shane,

I've successfully used :


import org.apache.spark.ml.classification.{ RandomForestClassificationModel,
RandomForestClassifier }

with pio. You can access feature importance through the
RandomForestClassifier also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <p...@occamsmachete.com> wrote:

> Actually there are libs that will read DFs from HBase
> https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
>
> This is out of band with PIO and should not be used IMO because the schema
> of the EventStore is not guaranteed to remain as-is. The safest way is to
> translate or get DFs integrated to PIO. I think there is an existing Jira
> that request Spark ML support, which assumes DFs.
>
>
> On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> Funny you should ask this. Yes, we are working on a DF based Universal
> Recommender but you have to convert the RDD into a DF since PIO does not
> read out data in the form of a DF (yet). This is a fairly simple step of
> maybe one line of code but would be better supported in PIO itself. The
> issue is that the EventStore uses libs that may not read out DFs, but RDDs.
> This is certainly the case with Elasticsearch, which provides an RDD lib. I
> haven’t seen one from them that read out DFs though it would make a lot of
> sense for ES especially.
>
> So TLDR; yes, just convert the RDD into a DF for now.
>
> Also please add a feature request as a PIO Jira ticket to look into this.
> I for one would +1
>
>
> On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com>
> wrote:
>
> Hello group, Happy new year! Does anyone have a working example or
> template using the DataFrame API vs. the RDD based APIs. We are wanting to
> migrate to using the new DataFrame APIs to take advantage of the *Feature
> Importance* function for our Regression Random Forest Models.
>
> We are wanting to move from
>
> import org.apache.spark.mllib.tree.RandomForestimport 
> org.apache.spark.mllib.tree.model.RandomForestModelimport 
> org.apache.spark.mllib.util.MLUtils
>
> to
>
> import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
> RandomForestRegressor}
>
>
> Is this something that should be fairly straightforward by adjusting
> parameters and calling new classes within DASE or is it much more involved
> development.
>
> Thank You!
>
> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
>
>

Re: Using Dataframe API vs. RDD API?

Reply via email to