Re: how spark handle the abnormal values

2022-05-02 Thread wilson
Thanks Mich. But many original datasource has the abnormal values included from my experience. I already used rlike and filter to implement the data cleaning as my this writing: https://bigcount.xyz/calculate-urban-words-vote-in-spark.html What I am surprised is that spark does the string to

Re: how spark handle the abnormal values

2022-05-02 Thread Mich Talebzadeh
Agg and ave are numeric functions dealing with the numeric values. Why is column number defined as String type? Do you perform data cleaning beforehand by any chance? It is good practice. Alternatively you can use the rlike() function to filter rows that have numeric values in a column..

Re: how spark handle the abnormal values

2022-05-01 Thread Artemis User
Your test result just gave the verdict so #2 is the answer - Spark ignores those non-numeric rows completely when aggregating the average. On 5/1/22 8:20 PM, wilson wrote: I did a small test as follows. scala> df.printSchema() root  |-- fruit: string (nullable = true)  |-- number: string

Re: how spark handle the abnormal values

2022-05-01 Thread wilson
I did a small test as follows. scala> df.printSchema() root |-- fruit: string (nullable = true) |-- number: string (nullable = true) scala> df.show() +--+--+ | fruit|number| +--+--+ | apple| 2| |orange| 5| |cherry| 7| | plum| xyz| +--+--+ scala>

how spark handle the abnormal values

2022-05-01 Thread wilson
|65.18445431897453| +-----+ so how spark handle the abnormal values in a numeric column? just ignore them? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org