Re: pyspark dataframe join with two different data type

Karthick Nk Tue, 14 May 2024 00:10:52 -0700

Hi All,

Could anyone have any idea or suggestion of any alternate way to achieve
this scenario?


Thanks.

On Sat, May 11, 2024 at 6:55 AM Damien Hawes <marley.ha...@gmail.com> wrote:

> Right now, with the structure of your data, it isn't possible.
>
> The rows aren't duplicates of each other. "a" and "b" both exist in the
> array. So Spark is correctly performing the join. It looks like you need to
> find another way to model this data to get what you want to achieve.
>
> Are the values of "a" and "b" related to each other in any way?
>
> - Damien
>
> Op vr 10 mei 2024 18:08 schreef Karthick Nk <kcekarth...@gmail.com>:
>
>> Hi Mich,
>>
>> Thanks for the solution, But I am getting duplicate result by using
>> array_contains. I have explained the scenario below, could you help me on
>> that, how we can achieve i have tried different way bu i could able to
>> achieve.
>>
>> For example
>>
>> data = [
>>     ["a"],
>>     ["b"],
>>     ["d"],
>> ]
>> column_names = ["column1"]
>> df = spark.createDataFrame(data, column_names)
>> df.display()
>>
>> [image: image.png]
>>
>> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
>> df_1.display()
>> [image: image.png]
>>
>>
>> final_df = df_1.join(df, expr("array_contains(data, column1)"))
>> final_df.display()
>>
>> Resul:
>> [image: image.png]
>>
>> But i need the result like below:
>>
>> [image: image.png]
>>
>> Why because
>>
>> In the df_1 i have only two records, in that first records onlly i have
>> matching value.
>> But both records from the df i.e *a, b* are present in the first records
>> itself, it is returning two records as resultant, but my expectation is to
>> return only one records means if any of the records from the df is present
>> in the df_1 it should return only one records from the df_1.
>>
>> Note:
>> 1. Here we are able to filter the duplicate records by using distinct of
>> ID field in the resultant df, bu I am thinking that shouldn't be effective
>> way, rather i am thinking of updating in array_contains steps itself.
>>
>> Thanks.
>>
>>
>> On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>>
>>> This is what you want, how to join two DFs with a string column in one
>>> and an array of strings in the other, keeping only rows where the
>>> string is present in the array.
>>>
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import expr
>>>
>>> spark = SparkSession.builder.appName("joins").getOrCreate()
>>>
>>> data1 = [Row(combined_id=[1, 2, 3])  # this one has a column combined_id
>>> as an array of integers
>>> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
>>> single integers
>>>
>>> df1 = spark.createDataFrame(data1)
>>> df2 = spark.createDataFrame(data2)
>>>
>>> df1.printSchema()
>>> df2.printSchema()
>>>
>>> # Perform the join with array_contains. It takes two arguments: an
>>> array and a value. It returns True if the value exists as an element
>>> within the array, otherwise False.
>>> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))
>>>
>>> # Show the result
>>> joined_df.show()
>>>
>>> root
>>>  |-- combined_id: array (nullable = true)
>>>  |    |-- element: long (containsNull = true)
>>>
>>> root
>>>  |-- mr_id: long (nullable = true)
>>>
>>> +-----------+-----+
>>> |combined_id|mr_id|
>>> +-----------+-----+
>>> |  [1, 2, 3]|    2|
>>> |  [4, 5, 6]|    5|
>>> +-----------+-----+
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have two dataframe with below structure, i have to join these two
>>>> dataframe - the scenario is one column is string in one dataframe and in
>>>> other df join column is array of string, so we have to inner join two df
>>>> and get the data if string value is present in any of the array of string
>>>> value in another dataframe,
>>>>
>>>>
>>>> df1 = spark.sql("""
>>>>     SELECT
>>>>         mr.id as mr_id,
>>>>         pv.id as pv_id,
>>>>         array(mr.id, pv.id) as combined_id
>>>>     FROM
>>>>         table1 mr
>>>>         INNER JOIN table2 pv ON pv.id = Mr.recordid
>>>>        where
>>>>         pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>         or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
>>>> """)
>>>>
>>>> # df1.display()
>>>>
>>>> # Your second query
>>>> df2 = spark.sql("""
>>>>     SELECT
>>>>         id
>>>>     FROM
>>>>         table2
>>>>     WHERE
>>>>         id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>
>>>> """)
>>>>
>>>>
>>>>
>>>> Result data:
>>>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone
>>>> is common between string and array of string value.
>>>>
>>>> Can you share the sample snippet, how we can do the join for this two
>>>> different datatype in the dataframe.
>>>>
>>>> if any clarification needed, pls feel free to ask.
>>>>
>>>> Thanks
>>>>
>>>>

Re: pyspark dataframe join with two different data type

Reply via email to