Hello Andrew,
few years ago I had the same need and I found this SO's answer
<https://stackoverflow.com/a/36306784/898154> the way to go.
Here an extract of my (Scala) code (which was doing other things on top), I
have removed the irrelevant parts but without testing it, so it might not
work out of the box, nonetheless it should help you starting:
private def getEncodedVectorLookupTable(df: DataFrame,
featuresColName: String):
> Map[Long, String] = {
val meta = df.select(featuresColName)
> .schema.fields.head.metadata
> .getMetadata("ml_attr")
> .getMetadata("attrs")
>
/* REFLECTION START */
> val field = meta.getClass.getDeclaredField("map")
> field.setAccessible(true)
> val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
> field.setAccessible(false)
> /* REFLECTION END */
keys.flatMap(
> meta.getMetadataArray(_)
> .map(m => m.getLong("idx") -> m.getString("name"))
> ).toMap
}
It looks like there is some support now for achieving this, but I have
never tried it:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html
Best regards,
Alessandro
On Mon, 28 Oct 2019 at 21:01, Andrew Redd <[email protected]> wrote:
>
> Hi All!
>
> I'm performing an econometric analysis over several billion rows of data
> and would like to use the Pyspark SparkML implementation of linear
> regression. In the example below I'm trying to interact hour of day and
> month of year indicators. The StringIndexer documentation tells you what
> it's doing when it's one hot encoding string/factor columns (i.e. taking
> out the most/least common value or first/last when sorted alphabetically)
> but doesn't allow you to recover your coefficient names. This feels like
> such a general case that I must be missing something. How can I get my
> column names back post regression to map to coefficient values? Do I need
> to basically rebuild the RFormula logic in if this isn't already
> implemented? Would be happy to use a different Spark language (Scala/Java
> etc. ) if implemented there.
>
> Thanks in advance
>
> Andrew
>
> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
> month_of_year + hour_of_day:month_of_year + additional_column",
> featuresCol="features",
> labelCol="label")
>
> rform_regression_input =
> rform.fit(regression_input).transform(regression_input)
>
> lr = LinearRegression(featuresCol='features',
> labelCol='label',
> solver='normal')
>
> lr_model = lr.fit(rform_regression_input)
> coefs = [ *lr_model.coefficients, lr_model.intercept]
>
> return pd.DataFrame(
> {"pvalues": lr_model.summary.pValues,
> "tvalues": lr_model.summary.tValues,
> "std_errs": lr_model.summary.coefficientStandardErrors,
> "coefs": coefs}
> )
>
>