
my dataset has abnormal values in the column whose normal values are numeric. I can select them as:

scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
|       up_votes|
|              <|
|              <|
|            fx-|
|             OP|
|              \|
|              v|
|             :O|
|              y|
|             :O|
|          ncurs|
|              )|
|              )|
|              X|
|             -1|
|':>?< ./ '[]\~`|
|           enc3|
|              X|
|              -|
|              X|
|              N|
only showing top 20 rows

Even there are those abnormal values in the column, spark can still aggregate them. as you can see below.

scala> df.agg(avg("up_votes")).show()
|    avg(up_votes)|

so how spark handle the abnormal values in a numeric column? just ignore them?

Thank you.

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to