I did a small test as follows.
scala> df.printSchema()
root
|-- fruit: string (nullable = true)
|-- number: string (nullable = true)
scala> df.show()
+------+------+
| fruit|number|
+------+------+
| apple| 2|
|orange| 5|
|cherry| 7|
| plum| xyz|
+------+------+
scala> df.agg(avg("number")).show()
+-----------------+
| avg(number)|
+-----------------+
|4.666666666666667|
+-----------------+
As you see, the "number" column is string type, and there is a abnormal
value in it.
But for these two cases spark still handles the result pretty well. So I
guess:
1) spark can make some auto translation from string to numeric when
aggregating.
2) spark ignore those abnormal values automatically when calculating the
relevant stuff.
Am I right? thank you.
wilson
wilson wrote:
my dataset has abnormal values in the column whose normal values are
numeric. I can select them as:
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org