Possibly instead of doing the initial grouping, just do a full outer join on
zyzy. This is in scala but should be easily convertible to python.
val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill",
"blue"), ("bill", "red"), ("sam", "green"))
val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a",
"b")
distData.show()
+----+-----+
| a| b|
+----+-----+
|john| red|
|john| blue|
|john| red|
|bill| blue|
|bill| red|
| sam|green|
+----+-----+
distData.as("tbl1").join(distData.as("tbl2"), Seq("a"),
"fullouter").select("tbl1.b", "tbl2.b").distinct.show()
+-----+-----+
| b| b|
+-----+-----+
| blue| red|
| red| blue|
| red| red|
| blue| blue|
|green|green|
+-----+-----+
From: Andy Davidson <[email protected]>
Date: Friday, March 30, 2018 at 8:58 PM
To: Andy Davidson <[email protected]>, user <[email protected]>
Subject: Re: how to create all possible combinations from an array? how to join
and explode row array?
I was a little sloppy when I created the sample output. Its missing a few pairs
Assume for a given row I have [a, b, c] I want to create something like the
cartesian join
From: Andrew Davidson <[email protected]>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <[email protected]>
Subject: how to create all possible combinations from an array? how to join and
explode row array?
I have a dataframe and execute df.groupBy(“xyzy”).agg( collect_list(“abc”)
This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency table.
Any idea how I can transform my data so that call crosstab() ? The join
transformation operate on the entire dataframe. I need something at the row
array level?
Bellow is some sample python and describes what I would like my results to be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin (cartesian
join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+