Re: pyspark dataframe join with two different data type

Karthick Nk Thu, 16 May 2024 23:12:35 -0700

Hi All,

I have tried the same result with pyspark and with SQL query by creating
with tempView, I could able to achieve whereas I have to do in the pyspark
code itself, Could you help on this


incoming_data = [["a"], ["b"], ["d"]]
column_names = ["column1"]
df = spark.createDataFrame(incoming_data, column_names)

view_data_df = spark.createDataFrame([(["a", "b", "c"], 1), (['f'], 2)],
['data_col', 'id'])

df.createOrReplaceTempView(f"id_table")
view_data_df.createOrReplaceTempView(f"view_data")

*%sql*
*select * from view_data*
*where exists (select 1 from id_table where array_contains(data_col,
column1))*

*Result:*

*data_col         id*
*["a","b","c"] 1*

I need this equivalent SQL query with pyspark code to achieve the result.

One of the solution, I have tried is below, but here I am doing explode and
doing distinct again, But I need to perform the action without doing this
since this will impact performance again for the huge data.

Thanks,


solutions

On Thu, May 16, 2024 at 8:33 AM Karthick Nk <kcekarth...@gmail.com> wrote:

> Thanks Mich,
>
> I have tried this solution, but i want all the columns from the dataframe
> df_1, if i explode the df_1 i am getting only data column. But the
> resultant should get the all the column from the df_1 with distinct result
> like below.
>
> Results in
>
> *df:*
> +-------+
> |column1|
> +-------+
> |      a|
> |      b|
> |      d|
> +-------+
>
> *df_1:*
> +---------+
> | id| data| field
> +---------+
> |1 | [a, b, c]| ['company']
> | 3| [b, c, d]| ['hello']
> | 4| [e, f, s]| ['hello']
> +---------+
>
> *Result:*
> +----+
> |id| data| field
> +----+
> |1| ['company']
> | 3|  ['helo']|
> +----+
>
> Explanation: id with 1, 3 why -> because a or b present in both records 1
> and 3 so returning distinct result from the join.
>
>
> Here I would like to get the result like above, even if I get the
> duplicate element in the column data, I need to get the distinct data
> with respect to id field, But when I try to use array_contain, it will
> return duplicate result since data column has multiple occurrence.
>
> If you need more clarification, please let me know.
>
> Thanks,
>
>
>
>
>
>
> On Tue, May 14, 2024 at 6:12 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> You can use a combination of explode and distinct before joining.
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import explode
>>
>> # Create a SparkSession
>> spark = SparkSession.builder \
>>     .appName("JoinExample") \
>>     .getOrCreate()
>>
>> sc = spark.sparkContext
>> # Set the log level to ERROR to reduce verbosity
>> sc.setLogLevel("ERROR")
>>
>> # Create DataFrame df
>> data = [
>>     ["a"],
>>     ["b"],
>>     ["d"],
>> ]
>> column_names = ["column1"]
>> df = spark.createDataFrame(data, column_names)
>> print("df:")
>> df.show()
>>
>> # Create DataFrame df_1
>> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
>> print("df_1:")
>> df_1.show()
>>
>> # Explode the array column in df_1
>> exploded_df_1 = df_1.select(explode("data").alias("data"))
>>
>> # Join with df using the exploded column
>> final_df = exploded_df_1.join(df, exploded_df_1.data == df.column1)
>>
>> # Distinct to ensure only unique rows are returned from df_1
>> final_df = final_df.select("data").distinct()
>>
>> print("Result:")
>> final_df.show()
>>
>>
>> Results in
>>
>> df:
>> +-------+
>> |column1|
>> +-------+
>> |      a|
>> |      b|
>> |      d|
>> +-------+
>>
>> df_1:
>> +---------+
>> |     data|
>> +---------+
>> |[a, b, c]|
>> |       []|
>> +---------+
>>
>> Result:
>> +----+
>> |data|
>> +----+
>> |   a|
>> |   b|
>> +----+
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Tue, 14 May 2024 at 13:19, Karthick Nk <kcekarth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Could anyone have any idea or suggestion of any alternate way to achieve
>>> this scenario?
>>>
>>> Thanks.
>>>
>>> On Sat, May 11, 2024 at 6:55 AM Damien Hawes <marley.ha...@gmail.com>
>>> wrote:
>>>
>>>> Right now, with the structure of your data, it isn't possible.
>>>>
>>>> The rows aren't duplicates of each other. "a" and "b" both exist in the
>>>> array. So Spark is correctly performing the join. It looks like you need to
>>>> find another way to model this data to get what you want to achieve.
>>>>
>>>> Are the values of "a" and "b" related to each other in any way?
>>>>
>>>> - Damien
>>>>
>>>> Op vr 10 mei 2024 18:08 schreef Karthick Nk <kcekarth...@gmail.com>:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Thanks for the solution, But I am getting duplicate result by using
>>>>> array_contains. I have explained the scenario below, could you help me on
>>>>> that, how we can achieve i have tried different way bu i could able to
>>>>> achieve.
>>>>>
>>>>> For example
>>>>>
>>>>> data = [
>>>>>     ["a"],
>>>>>     ["b"],
>>>>>     ["d"],
>>>>> ]
>>>>> column_names = ["column1"]
>>>>> df = spark.createDataFrame(data, column_names)
>>>>> df.display()
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
>>>>> df_1.display()
>>>>> [image: image.png]
>>>>>
>>>>>
>>>>> final_df = df_1.join(df, expr("array_contains(data, column1)"))
>>>>> final_df.display()
>>>>>
>>>>> Resul:
>>>>> [image: image.png]
>>>>>
>>>>> But i need the result like below:
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> Why because
>>>>>
>>>>> In the df_1 i have only two records, in that first records onlly i
>>>>> have matching value.
>>>>> But both records from the df i.e *a, b* are present in the first
>>>>> records itself, it is returning two records as resultant, but my
>>>>> expectation is to return only one records means if any of the records from
>>>>> the df is present in the df_1 it should return only one records from the
>>>>> df_1.
>>>>>
>>>>> Note:
>>>>> 1. Here we are able to filter the duplicate records by using distinct
>>>>> of ID field in the resultant df, bu I am thinking that shouldn't be
>>>>> effective way, rather i am thinking of updating in array_contains steps
>>>>> itself.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> This is what you want, how to join two DFs with a string column in
>>>>>> one and an array of strings in the other, keeping only rows where
>>>>>> the string is present in the array.
>>>>>>
>>>>>> from pyspark.sql import SparkSession
>>>>>> from pyspark.sql import Row
>>>>>> from pyspark.sql.functions import expr
>>>>>>
>>>>>> spark = SparkSession.builder.appName("joins").getOrCreate()
>>>>>>
>>>>>> data1 = [Row(combined_id=[1, 2, 3])  # this one has a column
>>>>>> combined_id as an array of integers
>>>>>> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
>>>>>> single integers
>>>>>>
>>>>>> df1 = spark.createDataFrame(data1)
>>>>>> df2 = spark.createDataFrame(data2)
>>>>>>
>>>>>> df1.printSchema()
>>>>>> df2.printSchema()
>>>>>>
>>>>>> # Perform the join with array_contains. It takes two arguments: an
>>>>>> array and a value. It returns True if the value exists as an element
>>>>>> within the array, otherwise False.
>>>>>> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))
>>>>>>
>>>>>> # Show the result
>>>>>> joined_df.show()
>>>>>>
>>>>>> root
>>>>>>  |-- combined_id: array (nullable = true)
>>>>>>  |    |-- element: long (containsNull = true)
>>>>>>
>>>>>> root
>>>>>>  |-- mr_id: long (nullable = true)
>>>>>>
>>>>>> +-----------+-----+
>>>>>> |combined_id|mr_id|
>>>>>> +-----------+-----+
>>>>>> |  [1, 2, 3]|    2|
>>>>>> |  [4, 5, 6]|    5|
>>>>>> +-----------+-----+
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Thu, 29 Feb 2024 at 20:50, Karthick Nk <kcekarth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have two dataframe with below structure, i have to join these two
>>>>>>> dataframe - the scenario is one column is string in one dataframe and in
>>>>>>> other df join column is array of string, so we have to inner join two df
>>>>>>> and get the data if string value is present in any of the array of 
>>>>>>> string
>>>>>>> value in another dataframe,
>>>>>>>
>>>>>>>
>>>>>>> df1 = spark.sql("""
>>>>>>>     SELECT
>>>>>>>         mr.id as mr_id,
>>>>>>>         pv.id as pv_id,
>>>>>>>         array(mr.id, pv.id) as combined_id
>>>>>>>     FROM
>>>>>>>         table1 mr
>>>>>>>         INNER JOIN table2 pv ON pv.id = Mr.recordid
>>>>>>>        where
>>>>>>>         pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>>>>         or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
>>>>>>> """)
>>>>>>>
>>>>>>> # df1.display()
>>>>>>>
>>>>>>> # Your second query
>>>>>>> df2 = spark.sql("""
>>>>>>>     SELECT
>>>>>>>         id
>>>>>>>     FROM
>>>>>>>         table2
>>>>>>>     WHERE
>>>>>>>         id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>>>>>
>>>>>>> """)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Result data:
>>>>>>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records
>>>>>>> alone is common between string and array of string value.
>>>>>>>
>>>>>>> Can you share the sample snippet, how we can do the join for this
>>>>>>> two different datatype in the dataframe.
>>>>>>>
>>>>>>> if any clarification needed, pls feel free to ask.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>

Re: pyspark dataframe join with two different data type

Reply via email to