Hello
my dataset has abnormal values in the column whose normal values are
numeric. I can select them as:
scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
+---------------+
| up_votes|
+---------------+
| <|
| <|
| fx-|
| OP|
| \|
| v|
| :O|
| y|
| :O|
| ncurs|
| )|
| )|
| X|
| -1|
|':>?< ./ '[]\~`|
| enc3|
| X|
| -|
| X|
| N|
+---------------+
only showing top 20 rows
Even there are those abnormal values in the column, spark can still
aggregate them. as you can see below.
scala> df.agg(avg("up_votes")).show()
+-----------------+
| avg(up_votes)|
+-----------------+
|65.18445431897453|
+-----------------+
so how spark handle the abnormal values in a numeric column? just ignore
them?
Thank you.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org