What's wrong just using a UDF doing for loop in scala? You can change the for
loop logic for what combination you want.
scala> spark.version
res4: String = 2.2.1
scala> aggDS.printSchema
root
|-- name: string (nullable = true)
|-- colors: array (nullable = true)
| |-- element: string (containsNull = true)
scala> aggDS.show(false)
+----+----------------+
|name|colors |
+----+----------------+
|john|[red, blue, red]|
|bill|[blue, red] |
|sam |[gree] |
+----+----------------+
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf
scala> val loopUDF = udf { x: Seq[String] => for (a <- x; b <-x) yield (a,b) }
loopUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function1>,ArrayType(StructType(StructField(_1,StringType,true),
StructField(_2,StringType,true)),true),Some(List(ArrayType(StringType,true))))
scala> aggDS.withColumn("newCol", loopUDF($"colors")).show(false)
+----+----------------+---------------------------------------------------------------------------------------------------------+
|name|colors |newCol
|
+----+----------------+---------------------------------------------------------------------------------------------------------+
|john|[red, blue, red]|[[red,red], [red,blue], [red,red], [blue,red],
[blue,blue], [blue,red], [red,red], [red,blue], [red,red]]|
|bill|[blue, red] |[[blue,blue], [blue,red], [red,blue], [red,red]]
|
|sam |[gree] |[[gree,gree]]
|
+----+----------------+-----------------------------------------------------------------
Yong
________________________________
From: Andy Davidson <[email protected]>
Sent: Friday, March 30, 2018 8:58 PM
To: Andy Davidson; user
Subject: Re: how to create all possible combinations from an array? how to join
and explode row array?
I was a little sloppy when I created the sample output. Its missing a few pairs
Assume for a given row I have [a, b, c] I want to create something like the
cartesian join
From: Andrew Davidson
<[email protected]<mailto:[email protected]>>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <[email protected]<mailto:[email protected]>>
Subject: how to create all possible combinations from an array? how to join and
explode row array?
I have a dataframe and execute df.groupBy(“xyzy”).agg( collect_list(“abc”)
This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency table.
Any idea how I can transform my data so that call crosstab() ? The join
transformation operate on the entire dataframe. I need something at the row
array level?
Bellow is some sample python and describes what I would like my results to be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin (cartesian
join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+