Re: pyspark dataframe join with two different data type

Karthick Nk Wed, 15 May 2024 20:05:04 -0700

Thanks Mich,

I have tried this solution, but i want all the columns from the dataframe
df_1, if i explode the df_1 i am getting only data column. But the
resultant should get the all the column from the df_1 with distinct result
like below.


Results in

*df:*
+-------+
|column1|
+-------+
|      a|
|      b|
|      d|
+-------+

*df_1:*
+---------+
| id| data| field
+---------+
|1 | [a, b, c]| ['company']
| 3| [b, c, d]| ['hello']
| 4| [e, f, s]| ['hello']
+---------+

*Result:*
+----+
|id| data| field
+----+
|1| ['company']
| 3|  ['helo']|
+----+

Explanation: id with 1, 3 why -> because a or b present in both records 1
and 3 so returning distinct result from the join.


Here I would like to get the result like above, even if I get the
duplicate element in the column data, I need to get the distinct data
with respect to id field, But when I try to use array_contain, it will
return duplicate result since data column has multiple occurrence.

If you need more clarification, please let me know.

Thanks,






On Tue, May 14, 2024 at 6:12 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> You can use a combination of explode and distinct before joining.
>
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode
>
> # Create a SparkSession
> spark = SparkSession.builder \
>     .appName("JoinExample") \
>     .getOrCreate()
>
> sc = spark.sparkContext
> # Set the log level to ERROR to reduce verbosity
> sc.setLogLevel("ERROR")
>
> # Create DataFrame df
> data = [
>     ["a"],
>     ["b"],
>     ["d"],
> ]
> column_names = ["column1"]
> df = spark.createDataFrame(data, column_names)
> print("df:")
> df.show()
>
> # Create DataFrame df_1
> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
> print("df_1:")
> df_1.show()
>
> # Explode the array column in df_1
> exploded_df_1 = df_1.select(explode("data").alias("data"))
>
> # Join with df using the exploded column
> final_df = exploded_df_1.join(df, exploded_df_1.data == df.column1)
>
> # Distinct to ensure only unique rows are returned from df_1
> final_df = final_df.select("data").distinct()
>
> print("Result:")
> final_df.show()
>
>
> Results in
>
> df:
> +-------+
> |column1|
> +-------+
> |      a|
> |      b|
> |      d|
> +-------+
>
> df_1:
> +---------+
> |     data|
> +---------+
> |[a, b, c]|
> |       []|
> +---------+
>
> Result:
> +----+
> |data|
> +----+
> |   a|
> |   b|
> +----+
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondon
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 14 May 2024 at 13:19, Karthick Nk <kcekarth...@gmail.com> wrote:
>
>> Hi All,
>>
>> Could anyone have any idea or suggestion of any alternate way to achieve
>> this scenario?
>>
>> Thanks.
>>
>> On Sat, May 11, 2024 at 6:55 AM Damien Hawes <marley.ha...@gmail.com>
>> wrote:
>>
>>> Right now, with the structure of your data, it isn't possible.
>>>
>>> The rows aren't duplicates of each other. "a" and "b" both exist in the
>>> array. So Spark is correctly performing the join. It looks like you need to
>>> find another way to model this data to get what you want to achieve.
>>>
>>> Are the values of "a" and "b" related to each other in any way?
>>>
>>> - Damien
>>>
>>> Op vr 10 mei 2024 18:08 schreef Karthick Nk <kcekarth...@gmail.com>:
>>>
>>>> Hi Mich,
>>>>
>>>> Thanks for the solution, But I am getting duplicate result by using
>>>> array_contains. I have explained the scenario below, could you help me on
>>>> that, how we can achieve i have tried different way bu i could able to
>>>> achieve.
>>>>
>>>> For example
>>>>
>>>> data = [
>>>>     ["a"],
>>>>     ["b"],
>>>>     ["d"],
>>>> ]
>>>> column_names = ["column1"]
>>>> df = spark.createDataFrame(data, column_names)
>>>> df.display()
>>>>
>>>> [image: image.png]
>>>>
>>>> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
>>>> df_1.display()
>>>> [image: image.png]
>>>>
>>>>
>>>> final_df = df_1.join(df, expr("array_contains(data, column1)"))
>>>> final_df.display()
>>>>
>>>> Resul:
>>>> [image: image.png]
>>>>
>>>> But i need the result like below:
>>>>
>>>> [image: image.png]
>>>>
>>>> Why because
>>>>
>>>> In the df_1 i have only two records, in that first records onlly i have
>>>> matching value.
>>>> But both records from the df i.e *a, b* are present in the first
>>>> records itself, it is returning two records as resultant, but my
>>>> expectation is to return only one records means if any of the records from
>>>> the df is present in the df_1 it should return only one records from the
>>>> df_1.
>>>>
>>>> Note:
>>>> 1. Here we are able to filter the duplicate records by using distinct
>>>> of ID field in the resultant df, bu I am thinking that shouldn't be
>>>> effective way, rather i am thinking of updating in array_contains steps
>>>> itself.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>>
>>>>> This is what you want, how to join two DFs with a string column in one
>>>>> and an array of strings in the other, keeping only rows where the
>>>>> string is present in the array.
>>>>>
>>>>> from pyspark.sql import SparkSession
>>>>> from pyspark.sql import Row
>>>>> from pyspark.sql.functions import expr
>>>>>
>>>>> spark = SparkSession.builder.appName("joins").getOrCreate()
>>>>>
>>>>> data1 = [Row(combined_id=[1, 2, 3])  # this one has a column
>>>>> combined_id as an array of integers
>>>>> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
>>>>> single integers
>>>>>
>>>>> df1 = spark.createDataFrame(data1)
>>>>> df2 = spark.createDataFrame(data2)
>>>>>
>>>>> df1.printSchema()
>>>>> df2.printSchema()
>>>>>
>>>>> # Perform the join with array_contains. It takes two arguments: an
>>>>> array and a value. It returns True if the value exists as an element
>>>>> within the array, otherwise False.
>>>>> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))
>>>>>
>>>>> # Show the result
>>>>> joined_df.show()
>>>>>
>>>>> root
>>>>>  |-- combined_id: array (nullable = true)
>>>>>  |    |-- element: long (containsNull = true)
>>>>>
>>>>> root
>>>>>  |-- mr_id: long (nullable = true)
>>>>>
>>>>> +-----------+-----+
>>>>> |combined_id|mr_id|
>>>>> +-----------+-----+
>>>>> |  [1, 2, 3]|    2|
>>>>> |  [4, 5, 6]|    5|
>>>>> +-----------+-----+
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have two dataframe with below structure, i have to join these two
>>>>>> dataframe - the scenario is one column is string in one dataframe and in
>>>>>> other df join column is array of string, so we have to inner join two df
>>>>>> and get the data if string value is present in any of the array of string
>>>>>> value in another dataframe,
>>>>>>
>>>>>>
>>>>>> df1 = spark.sql("""
>>>>>>     SELECT
>>>>>>         mr.id as mr_id,
>>>>>>         pv.id as pv_id,
>>>>>>         array(mr.id, pv.id) as combined_id
>>>>>>     FROM
>>>>>>         table1 mr
>>>>>>         INNER JOIN table2 pv ON pv.id = Mr.recordid
>>>>>>        where
>>>>>>         pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>>>         or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
>>>>>> """)
>>>>>>
>>>>>> # df1.display()
>>>>>>
>>>>>> # Your second query
>>>>>> df2 = spark.sql("""
>>>>>>     SELECT
>>>>>>         id
>>>>>>     FROM
>>>>>>         table2
>>>>>>     WHERE
>>>>>>         id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>>>
>>>>>> """)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Result data:
>>>>>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone
>>>>>> is common between string and array of string value.
>>>>>>
>>>>>> Can you share the sample snippet, how we can do the join for this two
>>>>>> different datatype in the dataframe.
>>>>>>
>>>>>> if any clarification needed, pls feel free to ask.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>

Re: pyspark dataframe join with two different data type

Reply via email to