NICE! Thanks Brandon
Andy. From: Brandon Geise <brandonge...@gmail.com> Date: Friday, March 30, 2018 at 6:15 PM To: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" <user@spark.apache.org> Subject: Re: how to create all possible combinations from an array? how to join and explode row array? > Possibly instead of doing the initial grouping, just do a full outer join on > zyzy. This is in scala but should be easily convertible to python. > > val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill", > "blue"), ("bill", "red"), ("sam", "green")) > val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a", > "b") > distData.show() > +----+-----+ > | a| b| > +----+-----+ > |john| red| > |john| blue| > |john| red| > |bill| blue| > |bill| red| > | sam|green| > +----+-----+ > > > distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), > "fullouter").select("tbl1.b", "tbl2.b").distinct.show() > > +-----+-----+ > | b| b| > +-----+-----+ > | blue| red| > | red| blue| > | red| red| > | blue| blue| > |green|green| > +-----+-----+ > > > > From: Andy Davidson <a...@santacruzintegration.com> > Date: Friday, March 30, 2018 at 8:58 PM > To: Andy Davidson <a...@santacruzintegration.com>, user > <user@spark.apache.org> > Subject: Re: how to create all possible combinations from an array? how to > join and explode row array? > > > > I was a little sloppy when I created the sample output. Its missing a few > pairs > > > > Assume for a given row I have [a, b, c] I want to create something like the > cartesian join > > > > From: Andrew Davidson <a...@santacruzintegration.com > <mailto:a...@santacruzintegration.com> > > Date: Friday, March 30, 2018 at 5:54 PM > To: "user @spark" <user@spark.apache.org <mailto:user@spark.apache.org> > > Subject: how to create all possible combinations from an array? how to join > and explode row array? > > >> >> I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²) >> >> >> >> This produces a column of type array. Now for each row I want to create a >> multiple pairs/tuples from the array so that I can create a contingency >> table. Any idea how I can transform my data so that call crosstab() ? The >> join transformation operate on the entire dataframe. I need something at the >> row array level? >> >> >> >> >> >> Bellow is some sample python and describes what I would like my results to >> be? >> >> >> >> Kind regards >> >> >> >> Andy >> >> >> >> >> >> c1 = ["john", "bill", "sam"] >> >> c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']] >> >> p = pd.DataFrame({"a":c1, "b":c2}) >> >> >> >> df = sqlContext.createDataFrame(p) >> >> df.printSchema() >> >> df.show() >> >> >> >> root >> >> |-- a: string (nullable = true) >> >> |-- b: array (nullable = true) >> >> | |-- element: string (containsNull = true) >> >> >> >> +----+----------------+ >> >> | a| b| >> >> +----+----------------+ >> >> |john|[red, blue, red]| >> >> |bill | [blue, red]| >> >> | sam| [green]| >> >> +----+----------------+ >> >> >> >> >> >> The output I am trying to create is. I could live with a crossJoin (cartesian >> join) and add my own filtering if it makes the problem easier? >> >> >> >> >> >> +----+----------------+ >> >> | x1| x2| >> >> +----+----------------+ >> >> red | blue >> >> red | red >> >> blue | red >> >> +----+----------------+ >> >> >> >>