Re: pyspark dataframe join with two different data type

Mich Talebzadeh Thu, 29 Feb 2024 14:42:07 -0800

This is what you want, how to join two DFs with a string column in one and
an array of strings in the other, keeping only rows where the string is
present in the array.


from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName("joins").getOrCreate()

data1 = [Row(combined_id=[1, 2, 3])  # this one has a column combined_id as
an array of integers
data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
single integers

df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

df1.printSchema()
df2.printSchema()

# Perform the join with array_contains. It takes two arguments: an array
and a value. It returns True if the value exists as an element within the
array, otherwise False.
joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))

# Show the result
joined_df.show()

root
 |-- combined_id: array (nullable = true)
 |    |-- element: long (containsNull = true)

root
 |-- mr_id: long (nullable = true)

+-----------+-----+
|combined_id|mr_id|
+-----------+-----+
|  [1, 2, 3]|    2|
|  [4, 5, 6]|    5|
+-----------+-----+

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com> wrote:

> Hi All,
>
> I have two dataframe with below structure, i have to join these two
> dataframe - the scenario is one column is string in one dataframe and in
> other df join column is array of string, so we have to inner join two df
> and get the data if string value is present in any of the array of string
> value in another dataframe,
>
>
> df1 = spark.sql("""
>     SELECT
>         mr.id as mr_id,
>         pv.id as pv_id,
>         array(mr.id, pv.id) as combined_id
>     FROM
>         table1 mr
>         INNER JOIN table2 pv ON pv.id = Mr.recordid
>        where
>         pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>         or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
> """)
>
> # df1.display()
>
> # Your second query
> df2 = spark.sql("""
>     SELECT
>         id
>     FROM
>         table2
>     WHERE
>         id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>
> """)
>
>
>
> Result data:
> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone is
> common between string and array of string value.
>
> Can you share the sample snippet, how we can do the join for this two
> different datatype in the dataframe.
>
> if any clarification needed, pls feel free to ask.
>
> Thanks
>
>

Re: pyspark dataframe join with two different data type

Reply via email to