Re: pyspark dataframe join with two different data type

Karthick Nk Fri, 10 May 2024 06:40:25 -0700

Hi Mich,

Thanks for the solution, But I am getting duplicate result by using
array_contains. I have explained the scenario below, could you help me on
that, how we can achieve i have tried different way bu i could able to
achieve.


For example

data = [
    ["a"],
    ["b"],
    ["d"],
]
column_names = ["column1"]
df = spark.createDataFrame(data, column_names)
df.display()

[image: image.png]

df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df_1.display()
[image: image.png]


final_df = df_1.join(df, expr("array_contains(data, column1)"))
final_df.display()

Resul:
[image: image.png]

But i need the result like below:

[image: image.png]

Why because

In the df_1 i have only two records, in that first records onlly i have
matching value.
But both records from the df i.e *a, b* are present in the first records
itself, it is returning two records as resultant, but my expectation is to
return only one records means if any of the records from the df is present
in the df_1 it should return only one records from the df_1.

Note:
1. Here we are able to filter the duplicate records by using distinct of ID
field in the resultant df, bu I am thinking that shouldn't be effective
way, rather i am thinking of updating in array_contains steps itself.

Thanks.


On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> This is what you want, how to join two DFs with a string column in one and
> an array of strings in the other, keeping only rows where the string is
> present in the array.
>
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> from pyspark.sql.functions import expr
>
> spark = SparkSession.builder.appName("joins").getOrCreate()
>
> data1 = [Row(combined_id=[1, 2, 3])  # this one has a column combined_id
> as an array of integers
> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
> single integers
>
> df1 = spark.createDataFrame(data1)
> df2 = spark.createDataFrame(data2)
>
> df1.printSchema()
> df2.printSchema()
>
> # Perform the join with array_contains. It takes two arguments: an array
> and a value. It returns True if the value exists as an element within the
> array, otherwise False.
> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))
>
> # Show the result
> joined_df.show()
>
> root
>  |-- combined_id: array (nullable = true)
>  |    |-- element: long (containsNull = true)
>
> root
>  |-- mr_id: long (nullable = true)
>
> +-----------+-----+
> |combined_id|mr_id|
> +-----------+-----+
> |  [1, 2, 3]|    2|
> |  [4, 5, 6]|    5|
> +-----------+-----+
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have two dataframe with below structure, i have to join these two
>> dataframe - the scenario is one column is string in one dataframe and in
>> other df join column is array of string, so we have to inner join two df
>> and get the data if string value is present in any of the array of string
>> value in another dataframe,
>>
>>
>> df1 = spark.sql("""
>>     SELECT
>>         mr.id as mr_id,
>>         pv.id as pv_id,
>>         array(mr.id, pv.id) as combined_id
>>     FROM
>>         table1 mr
>>         INNER JOIN table2 pv ON pv.id = Mr.recordid
>>        where
>>         pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>         or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
>> """)
>>
>> # df1.display()
>>
>> # Your second query
>> df2 = spark.sql("""
>>     SELECT
>>         id
>>     FROM
>>         table2
>>     WHERE
>>         id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>
>> """)
>>
>>
>>
>> Result data:
>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone is
>> common between string and array of string value.
>>
>> Can you share the sample snippet, how we can do the join for this two
>> different datatype in the dataframe.
>>
>> if any clarification needed, pls feel free to ask.
>>
>> Thanks
>>
>>

Re: pyspark dataframe join with two different data type

Reply via email to