Chhavi Bansal created SPARK-48463:
-------------------------------------

             Summary: MLLib function unable to handle tested data
                 Key: SPARK-48463
                 URL: https://issues.apache.org/jira/browse/SPARK-48463
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
    Affects Versions: 3.5.1
            Reporter: Chhavi Bansal


I am trying to use feature transformer on nested data after flattening, but it 
fails.

 
{code:java}
val structureData = Seq(
  Row(Row(10, 12), 1000),
  Row(Row(12, 14), 4300),
  Row( Row(37, 891), 1400),
  Row(Row(8902, 12), 4000),
  Row(Row(12, 89), 1000)
)

val structureSchema = new StructType()
  .add("location", new StructType()
    .add("longitude", IntegerType)
    .add("latitude", IntegerType))
  .add("salary", IntegerType) 
val df = spark.createDataFrame(spark.sparkContext.parallelize(structureData), 
structureSchema) 

def flattenSchema(schema: StructType, prefix: String = null, prefixSelect: 
String = null):
Array[Column] = {
  schema.fields.flatMap(f => {
    val colName = if (prefix == null) f.name else (prefix + "." + f.name)
    val colnameSelect = if (prefix == null) f.name else (prefixSelect + "." + 
f.name)

    f.dataType match {
      case st: StructType => flattenSchema(st, colName, colnameSelect)
      case _ =>
        Array(col(colName).as(colnameSelect))
    }
  })
}

val flattenColumns = flattenSchema(df.schema)
val flattenedDf = df.select(flattenColumns: _*){code}
Now using the string indexer on the DOT notation.

 
{code:java}
val si = new 
StringIndexer().setInputCol("location.longitude").setOutputCol("longitutdee")
val pipeline = new Pipeline().setStages(Array(si))
pipeline.fit(flattenedDf).transform(flattenedDf).show() {code}
The above code fails 
{code:java}
xception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
resolve column name "location.longitude" among (location.longitude, 
location.latitude, salary); did you mean to quote the `location.longitude` 
column?
    at 
org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveColumnNameAmongFieldsError(QueryCompilationErrors.scala:2261)
    at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:258)
    at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:250)
..... {code}
This points to the same failure as when we try to select dot notation columns 
in a spark dataframe, which is solved using BACKTICKS *`column.name`.* 

[https://stackoverflow.com/a/51430335/11688337]

 

*so next*

I use the back ticks while defining stringIndexer
{code:java}
val si = new 
StringIndexer().setInputCol("`location.longitude`").setOutputCol("longitutdee") 
{code}
In this case *it again fails* (with a diff reason) in the stringIndexer code 
itself
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Input column 
`location.longitude` does not exist.
    at 
org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
    at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
{code}
 

This blocks me to use feature transformation functions on nested columns. 
Any help in solving this problem will be highly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to