I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²)
This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency
table. Any idea how I can transform my data so that call crosstab() ? The
join transformation operate on the entire dataframe. I need something at the
row array level?
Bellow is some sample python and describes what I would like my results to
be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin
(cartesian join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+