how spark handle the abnormal values

wilson Sun, 01 May 2022 01:14:57 -0700

Hello

my dataset has abnormal values in the column whose normal values arenumeric. I can select them as:


scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show()
+---------------+
|       up_votes|
+---------------+
|              <|
|              <|
|            fx-|
|             OP|
|              \|
|              v|
|             :O|
|              y|
|             :O|
|          ncurs|
|              )|
|              )|
|              X|
|             -1|
|':>?< ./ '[]\~`|
|           enc3|
|              X|
|              -|
|              X|
|              N|
+---------------+
only showing top 20 rows

Even there are those abnormal values in the column, spark can stillaggregate them. as you can see below.



scala> df.agg(avg("up_votes")).show()

+-----------------+

|    avg(up_votes)|
+-----------------+
|65.18445431897453|
+-----------------+

so how spark handle the abnormal values in a numeric column? just ignorethem?



Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

how spark handle the abnormal values

Reply via email to