Re: how spark handle the abnormal values

Mich Talebzadeh Mon, 02 May 2022 00:17:42 -0700

Agg and ave are numeric functions dealing with the numeric values. Why is
column number defined as String type?


Do you perform data cleaning beforehand by any chance? It is good practice.

Alternatively you can use the rlike() function to filter rows that have
numeric values in a column..


scala> val data = Seq((1,"123456","123456"),

     |   (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH"))

data: Seq[(Int, String, String)] = List((1,123456,123456),
(2,3456234,ABCD12345), (3,48973456,ABCDEFGH))


scala> val df = data.toDF("id","all_numeric","alphanumeric")

df: org.apache.spark.sql.DataFrame = [id: int, all_numeric: string ... 1
more field]


scala> df.show()

+---+-----------+------------+

| id|all_numeric|alphanumeric|

+---+-----------+------------+

|  1|     123456|      123456|

|  2|    3456234|   ABCD12345|

|  3|   48973456|    ABCDEFGH|

+---+-----------+------------+

scala> df.filter(col("alphanumeric")
     |     .rlike("^[0-9]*$")
     |   ).show()
+---+-----------+------------+
| id|all_numeric|alphanumeric|
+---+-----------+------------+
|  1|     123456|      123456|
+---+-----------+------------+


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 2 May 2022 at 01:21, wilson <wil...@4shield.net> wrote:

> I did a small test as follows.
>
> scala> df.printSchema()
> root
>   |-- fruit: string (nullable = true)
>   |-- number: string (nullable = true)
>
>
> scala> df.show()
> +------+------+
> | fruit|number|
> +------+------+
> | apple|     2|
> |orange|     5|
> |cherry|     7|
> |  plum|   xyz|
> +------+------+
>
>
> scala> df.agg(avg("number")).show()
> +-----------------+
> |      avg(number)|
> +-----------------+
> |4.666666666666667|
> +-----------------+
>
>
> As you see, the "number" column is string type, and there is a abnormal
> value in it.
>
> But for these two cases spark still handles the result pretty well. So I
> guess:
>
> 1) spark can make some auto translation from string to numeric when
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating the
> relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
> > my dataset has abnormal values in the column whose normal values are
> > numeric. I can select them as:
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: how spark handle the abnormal values

Reply via email to