date:20240515

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk

Thanks Mich,

I have tried this solution, but i want all the columns from the dataframe
df_1, if i explode the df_1 i am getting only data column. But the
resultant should get the all the column from the df_1 with distinct result
like below.

Results in

*df:*
+---+
|column1|
+---+
|  a|
|  b|
|  d|
+---+

*df_1:*
+-+
| id| data| field
+-+
|1 | [a, b, c]| ['company']
| 3| [b, c, d]| ['hello']
| 4| [e, f, s]| ['hello']
+-+

*Result:*
++
|id| data| field
++
|1| ['company']
| 3|  ['helo']|
++

Explanation: id with 1, 3 why -> because a or b present in both records 1
and 3 so returning distinct result from the join.


Here I would like to get the result like above, even if I get the
duplicate element in the column data, I need to get the distinct data
with respect to id field, But when I try to use array_contain, it will
return duplicate result since data column has multiple occurrence.

If you need more clarification, please let me know.

Thanks,






On Tue, May 14, 2024 at 6:12 PM Mich Talebzadeh 
wrote:

> You can use a combination of explode and distinct before joining.
>
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode
>
> # Create a SparkSession
> spark = SparkSession.builder \
> .appName("JoinExample") \
> .getOrCreate()
>
> sc = spark.sparkContext
> # Set the log level to ERROR to reduce verbosity
> sc.setLogLevel("ERROR")
>
> # Create DataFrame df
> data = [
> ["a"],
> ["b"],
> ["d"],
> ]
> column_names = ["column1"]
> df = spark.createDataFrame(data, column_names)
> print("df:")
> df.show()
>
> # Create DataFrame df_1
> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
> print("df_1:")
> df_1.show()
>
> # Explode the array column in df_1
> exploded_df_1 = df_1.select(explode("data").alias("data"))
>
> # Join with df using the exploded column
> final_df = exploded_df_1.join(df, exploded_df_1.data == df.column1)
>
> # Distinct to ensure only unique rows are returned from df_1
> final_df = final_df.select("data").distinct()
>
> print("Result:")
> final_df.show()
>
>
> Results in
>
> df:
> +---+
> |column1|
> +---+
> |  a|
> |  b|
> |  d|
> +---+
>
> df_1:
> +-+
> | data|
> +-+
> |[a, b, c]|
> |   []|
> +-+
>
> Result:
> ++
> |data|
> ++
> |   a|
> |   b|
> ++
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrimeLondon
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 14 May 2024 at 13:19, Karthick Nk  wrote:
>
>> Hi All,
>>
>> Could anyone have any idea or suggestion of any alternate way to achieve
>> this scenario?
>>
>> Thanks.
>>
>> On Sat, May 11, 2024 at 6:55 AM Damien Hawes 
>> wrote:
>>
>>> Right now, with the structure of your data, it isn't possible.
>>>
>>> The rows aren't duplicates of each other. "a" and "b" both exist in the
>>> array. So Spark is correctly performing the join. It looks like you need to
>>> find another way to model this data to get what you want to achieve.
>>>
>>> Are the values of "a" and "b" related to each other in any way?
>>>
>>> - Damien
>>>
>>> Op vr 10 mei 2024 18:08 schreef Karthick Nk :
>>>
 Hi Mich,

 Thanks for the solution, But I am getting duplicate result by using
 array_contains. I have explained the scenario below, could you help me on
 that, how we can achieve i have tried different way bu i could able to
 achieve.

 For example

 data = [
 ["a"],
 ["b"],
 ["d"],
 ]
 column_names = ["column1"]
 df = spark.createDataFrame(data, column_names)
 df.display()

 [image: image.png]

 df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
 df_1.display()
 [image: image.png]


 final_df = df_1.join(df, expr("array_contains(data, column1)"))
 final_df.display()

 Resul:
 [image: image.png]

 But i need the result like below:

 [image: image.png]

 Why because

 In the df_1 i have only two records, in that first records onlly i have
 matching value.
 But both records from the df i.e *a, b* are present in the first
 records itself, it is returning two records as resultant, but my
 expectation is to return only one records means if any of the records from
 the df is present in the df_1 it should return only one records from the
 df_1.

>>>

How to provide a Zstd "training mode" dictionary object

2024-05-15 Thread Saha, Daniel

Hi,

I understand that Zstd compression can optionally be provided a dictionary 
object to improve performance. See “training mode” here 
https://facebook.github.io/zstd/

Does Spark surface a way to provide this dictionary object when writing/reading 
data? What about for intermediate shuffle results?

Thanks,
Daniel

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

2024-05-15 Thread Nagatomi Yasukazu

Hi Spark Community,

I have a question regarding the support for User-Defined Functions (UDFs)
in Spark Connect, specifically when using Kubernetes as the Cluster Manager.

According to the Spark documentation, UDFs are supported by default for the
shell and in standalone applications with additional setup requirements.

However, it is not clear whether this support extends to scenarios where
Kubernetes is used as the Cluster Manager.

cf.
https://spark.apache.org/docs/latest/spark-connect-overview.html#:~:text=User%2DDefined%20Functions%20(UDFs)%20are%20supported%2C%20by%20default%20for%20the%20shell%20and%20in%20standalone%20applications%20with%20additional%20set%2Dup%20requirements
.

Could someone please clarify:

1. Are UDFs supported in Spark Connect when using Kubernetes as the Cluster
Manager?

2. If they are supported, are there any specific setup requirements or
limitations we should be aware of?

3. If UDFs are not currently supported with Kubernetes as the Cluster
Manager, are there any plans to include this support in future releases?

Your insights and guidance on this matter would be greatly appreciated.

Thank you in advance for your help!

Best regards,
Yasukazau

Re: pyspark dataframe join with two different data type

How to provide a Zstd "training mode" dictionary object

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

3 matches

Site Navigation

Mail list logo

Footer information