Your test result just gave the verdict so #2 is the answer - Spark ignores those non-numeric rows completely when aggregating the average.

On 5/1/22 8:20 PM, wilson wrote:
I did a small test as follows.

scala> df.printSchema()
 |-- fruit: string (nullable = true)
 |-- number: string (nullable = true)

| fruit|number|
| apple|     2|
|orange|     5|
|cherry|     7|
|  plum|   xyz|

scala> df.agg(avg("number")).show()
|      avg(number)|

As you see, the "number" column is string type, and there is a abnormal value in it.

But for these two cases spark still handles the result pretty well. So I guess:

1) spark can make some auto translation from string to numeric when aggregating. 2) spark ignore those abnormal values automatically when calculating the relevant stuff.

Am I right? thank you.


wilson wrote:
my dataset has abnormal values in the column whose normal values are numeric. I can select them as:

To unsubscribe e-mail:

To unsubscribe e-mail:

Reply via email to