Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below.
Results in *df:* +-------+ |column1| +-------+ | a| | b| | d| +-------+ *df_1:* +---------+ | id| data| field +---------+ |1 | [a, b, c]| ['company'] | 3| [b, c, d]| ['hello'] | 4| [e, f, s]| ['hello'] +---------+ *Result:* +----+ |id| data| field +----+ |1| ['company'] | 3| ['helo']| +----+ Explanation: id with 1, 3 why -> because a or b present in both records 1 and 3 so returning distinct result from the join. Here I would like to get the result like above, even if I get the duplicate element in the column data, I need to get the distinct data with respect to id field, But when I try to use array_contain, it will return duplicate result since data column has multiple occurrence. If you need more clarification, please let me know. Thanks, On Tue, May 14, 2024 at 6:12 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > You can use a combination of explode and distinct before joining. > > from pyspark.sql import SparkSession > from pyspark.sql.functions import explode > > # Create a SparkSession > spark = SparkSession.builder \ > .appName("JoinExample") \ > .getOrCreate() > > sc = spark.sparkContext > # Set the log level to ERROR to reduce verbosity > sc.setLogLevel("ERROR") > > # Create DataFrame df > data = [ > ["a"], > ["b"], > ["d"], > ] > column_names = ["column1"] > df = spark.createDataFrame(data, column_names) > print("df:") > df.show() > > # Create DataFrame df_1 > df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) > print("df_1:") > df_1.show() > > # Explode the array column in df_1 > exploded_df_1 = df_1.select(explode("data").alias("data")) > > # Join with df using the exploded column > final_df = exploded_df_1.join(df, exploded_df_1.data == df.column1) > > # Distinct to ensure only unique rows are returned from df_1 > final_df = final_df.select("data").distinct() > > print("Result:") > final_df.show() > > > Results in > > df: > +-------+ > |column1| > +-------+ > | a| > | b| > | d| > +-------+ > > df_1: > +---------+ > | data| > +---------+ > |[a, b, c]| > | []| > +---------+ > > Result: > +----+ > |data| > +----+ > | a| > | b| > +----+ > > HTH > > Mich Talebzadeh, > > Technologist | Architect | Data Engineer | Generative AI | FinCrimeLondon > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 14 May 2024 at 13:19, Karthick Nk <kcekarth...@gmail.com> wrote: > >> Hi All, >> >> Could anyone have any idea or suggestion of any alternate way to achieve >> this scenario? >> >> Thanks. >> >> On Sat, May 11, 2024 at 6:55 AM Damien Hawes <marley.ha...@gmail.com> >> wrote: >> >>> Right now, with the structure of your data, it isn't possible. >>> >>> The rows aren't duplicates of each other. "a" and "b" both exist in the >>> array. So Spark is correctly performing the join. It looks like you need to >>> find another way to model this data to get what you want to achieve. >>> >>> Are the values of "a" and "b" related to each other in any way? >>> >>> - Damien >>> >>> Op vr 10 mei 2024 18:08 schreef Karthick Nk <kcekarth...@gmail.com>: >>> >>>> Hi Mich, >>>> >>>> Thanks for the solution, But I am getting duplicate result by using >>>> array_contains. I have explained the scenario below, could you help me on >>>> that, how we can achieve i have tried different way bu i could able to >>>> achieve. >>>> >>>> For example >>>> >>>> data = [ >>>> ["a"], >>>> ["b"], >>>> ["d"], >>>> ] >>>> column_names = ["column1"] >>>> df = spark.createDataFrame(data, column_names) >>>> df.display() >>>> >>>> [image: image.png] >>>> >>>> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) >>>> df_1.display() >>>> [image: image.png] >>>> >>>> >>>> final_df = df_1.join(df, expr("array_contains(data, column1)")) >>>> final_df.display() >>>> >>>> Resul: >>>> [image: image.png] >>>> >>>> But i need the result like below: >>>> >>>> [image: image.png] >>>> >>>> Why because >>>> >>>> In the df_1 i have only two records, in that first records onlly i have >>>> matching value. >>>> But both records from the df i.e *a, b* are present in the first >>>> records itself, it is returning two records as resultant, but my >>>> expectation is to return only one records means if any of the records from >>>> the df is present in the df_1 it should return only one records from the >>>> df_1. >>>> >>>> Note: >>>> 1. Here we are able to filter the duplicate records by using distinct >>>> of ID field in the resultant df, bu I am thinking that shouldn't be >>>> effective way, rather i am thinking of updating in array_contains steps >>>> itself. >>>> >>>> Thanks. >>>> >>>> >>>> On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> >>>>> This is what you want, how to join two DFs with a string column in one >>>>> and an array of strings in the other, keeping only rows where the >>>>> string is present in the array. >>>>> >>>>> from pyspark.sql import SparkSession >>>>> from pyspark.sql import Row >>>>> from pyspark.sql.functions import expr >>>>> >>>>> spark = SparkSession.builder.appName("joins").getOrCreate() >>>>> >>>>> data1 = [Row(combined_id=[1, 2, 3]) # this one has a column >>>>> combined_id as an array of integers >>>>> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with >>>>> single integers >>>>> >>>>> df1 = spark.createDataFrame(data1) >>>>> df2 = spark.createDataFrame(data2) >>>>> >>>>> df1.printSchema() >>>>> df2.printSchema() >>>>> >>>>> # Perform the join with array_contains. It takes two arguments: an >>>>> array and a value. It returns True if the value exists as an element >>>>> within the array, otherwise False. >>>>> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)")) >>>>> >>>>> # Show the result >>>>> joined_df.show() >>>>> >>>>> root >>>>> |-- combined_id: array (nullable = true) >>>>> | |-- element: long (containsNull = true) >>>>> >>>>> root >>>>> |-- mr_id: long (nullable = true) >>>>> >>>>> +-----------+-----+ >>>>> |combined_id|mr_id| >>>>> +-----------+-----+ >>>>> | [1, 2, 3]| 2| >>>>> | [4, 5, 6]| 5| >>>>> +-----------+-----+ >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Dad | Technologist | Solutions Architect | Engineer >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>> >>>>> >>>>> On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have two dataframe with below structure, i have to join these two >>>>>> dataframe - the scenario is one column is string in one dataframe and in >>>>>> other df join column is array of string, so we have to inner join two df >>>>>> and get the data if string value is present in any of the array of string >>>>>> value in another dataframe, >>>>>> >>>>>> >>>>>> df1 = spark.sql(""" >>>>>> SELECT >>>>>> mr.id as mr_id, >>>>>> pv.id as pv_id, >>>>>> array(mr.id, pv.id) as combined_id >>>>>> FROM >>>>>> table1 mr >>>>>> INNER JOIN table2 pv ON pv.id = Mr.recordid >>>>>> where >>>>>> pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36' >>>>>> or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b' >>>>>> """) >>>>>> >>>>>> # df1.display() >>>>>> >>>>>> # Your second query >>>>>> df2 = spark.sql(""" >>>>>> SELECT >>>>>> id >>>>>> FROM >>>>>> table2 >>>>>> WHERE >>>>>> id = '35122806-4cd2-4916-a149-24ea55c2dc36' >>>>>> >>>>>> """) >>>>>> >>>>>> >>>>>> >>>>>> Result data: >>>>>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone >>>>>> is common between string and array of string value. >>>>>> >>>>>> Can you share the sample snippet, how we can do the join for this two >>>>>> different datatype in the dataframe. >>>>>> >>>>>> if any clarification needed, pls feel free to ask. >>>>>> >>>>>> Thanks >>>>>> >>>>>>