Hello

my dataset has abnormal values in the column whose normal values are numeric. I can select them as:

scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
+---------------+
|       up_votes|
+---------------+
|              <|
|              <|
|            fx-|
|             OP|
|              \|
|              v|
|             :O|
|              y|
|             :O|
|          ncurs|
|              )|
|              )|
|              X|
|             -1|
|':>?< ./ '[]\~`|
|           enc3|
|              X|
|              -|
|              X|
|              N|
+---------------+
only showing top 20 rows


Even there are those abnormal values in the column, spark can still aggregate them. as you can see below.


scala> df.agg(avg("up_votes")).show()
+-----------------+
|    avg(up_votes)|
+-----------------+
|65.18445431897453|
+-----------------+

so how spark handle the abnormal values in a numeric column? just ignore them?


Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to