Thanks Mich.
But many original datasource has the abnormal values included from my
experience.
I already used rlike and filter to implement the data cleaning as my
this writing:
https://bigcount.xyz/calculate-urban-words-vote-in-spark.html
What I am surprised is that spark does the string to
Agg and ave are numeric functions dealing with the numeric values. Why is
column number defined as String type?
Do you perform data cleaning beforehand by any chance? It is good practice.
Alternatively you can use the rlike() function to filter rows that have
numeric values in a column..
Your test result just gave the verdict so #2 is the answer - Spark
ignores those non-numeric rows completely when aggregating the average.
On 5/1/22 8:20 PM, wilson wrote:
I did a small test as follows.
scala> df.printSchema()
root
|-- fruit: string (nullable = true)
|-- number: string
I did a small test as follows.
scala> df.printSchema()
root
|-- fruit: string (nullable = true)
|-- number: string (nullable = true)
scala> df.show()
+--+--+
| fruit|number|
+--+--+
| apple| 2|
|orange| 5|
|cherry| 7|
| plum| xyz|
+--+--+
scala>
|65.18445431897453|
+-----+
so how spark handle the abnormal values in a numeric column? just ignore
them?
Thank you.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org