Parse Execution Plan from PySpark

2022-05-02 Thread Pablo Alcain
Hello all! I'm working with PySpark trying to reproduce some of the results
we see on batch through streaming processes, just as a PoC for now. For
this, I'm thinking of trying to interpret the execution plan and eventually
write it back to Python (I'm doing something similar with pandas as well,
and I'd like both approaches to be as similar as possible).

Let me clarify with an example: suppose that starting with a `geometry.csv`
file with `width` and `height` I want to calculate the `area` doing this:

>>> geometry = spark.read.csv('geometry.csv', header=True)
>>> geometry = geometry.withColumn('area', F.col('width') * F.col('height'))

I would like to extract from the execution plan the fact that area is
calculated as the product of width * height. One possibility would be to
parse the execution plan:

>>> geometry.explain(True)

...
== Optimized Logical Plan ==
Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as
double)) AS area#64]
+- Relation [width#45,height#46] csv
...

>From the first line of the Logical Plan we can parse the formula "area =
height * width" and then write the function back in any language.

However, even though I'm getting the logical plan as a string, there has to
be some internal representation that I could leverage and avoid the string
parsing. Do you know if/how I can access that internal representation from
Python? I've been trying to navigate the scala source code to find it, but
this is definitely beyond my area of expertise, so any pointers would be
more than welcome.

Thanks in advance,
Pablo


unsubscribe

2022-05-02 Thread Ahmed Kamal Abdelfatah


-- 










*This email, including any information it contains and any 
attachments to it, is confidential and may be privileged. This email is 
intended only for the use of the named recipient(s). If you are not a named 
recipient, please notify the sender immediately by replying to this message 
and delete the original message. You should not disclose or copy this 
email, any of its contents or any attachments to it. This email may have 
been transmitted over an unsecure public network and, therefore, Careem 
does not accept responsibility for its contents or for any damage sustained 
as a result of viewing its contents or in connection with its transmission. 
Careem reserves the right to monitor all communications from or to this 
account.*


unsubscribe

2022-05-02 Thread Ray Qiu



Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-05-02 Thread Artemis User
What scanner did you use? Looks like all CVEs you listed for 
jackson-databind-xxx.jar are for older versions (2.9.10.x).  A quick 
search on NVD revealed that there is only one CVE (CVE-2020-36518) that 
affects your Spark versions.  This CVE (not on your scanned CVE list) is 
on jackson-databind jar versions before 2.13.0, and Spark 3.2.1 uses 
version 2.12.x.  The other two Spark versions use version 2.10.x.


Surprisingly, Spark 3.2.0 uses the jackson-databind library of version 
2.13.0 (don't know why 3.2.1 uses an older version) so Spark 3.2.0 
shouldn't have any known CVEs related to jackson-databind. You may want 
to either use Spark 3.2.0 or do your own Spark build with the latest 
version of jackson-databind lib (2.14.x).


On 5/2/22 1:46 AM, HARSH TAKKAR wrote:

We scanned 3 versions of spark 3.0.0, 3.1.3, 3.2.1



On Tue, 26 Apr, 2022, 18:46 Bjørn Jørgensen, 
 wrote:


What version of spark is it that you have scanned?



tir. 26. apr. 2022 kl. 12:48 skrev HARSH TAKKAR
:

Hello,

Please let me know if there is a fix available for following
vulnerabilities in htrace jar used in spark jars folder.
LIBRARY: com.fasterxml.jackson.core:jackson-databind

VULNERABILITY IDs :

CVE-2020-9548
CVE-2020-9547
CVE-2020-8840
CVE-2020-36179
CVE-2020-35491
CVE-2020-35490
CVE-2020-25649
CVE-2020-24750
CVE-2020-24616
CVE-2020-10673
CVE-2019-20330
CVE-2019-17531
CVE-2019-17267
CVE-2019-16943
CVE-2019-16942
CVE-2019-16335
CVE-2019-14893
CVE-2019-14892
CVE-2019-14540
CVE-2019-14439
CVE-2019-14379
CVE-2019-12086
CVE-2018-7489
CVE-2018-5968
CVE-2018-14719
CVE-2018-14718
CVE-2018-12022
CVE-2018-11307
CVE-2017-7525
CVE-2017-17485
CVE-2017-15095


Kind Regards

Harsh Takkar



-- 
Bjørn Jørgensen

Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297



Unsubscribe

2022-05-02 Thread Sahil Bali
Unsubscribe


Re: how spark handle the abnormal values

2022-05-02 Thread wilson

Thanks Mich.
But many original datasource has the abnormal values included from my 
experience.
I already used rlike and filter to implement the data cleaning as my 
this writing:

https://bigcount.xyz/calculate-urban-words-vote-in-spark.html

What I am surprised is that spark does the string to numeric converting 
automatically and ignore those non-numeric columns. Based on this, my 
data cleaning seems meaningless.


Thanks.

Mich Talebzadeh wrote:
Agg and ave are numeric functions dealing with the numeric values. Why 
is column number defined as String type?


Do you perform data cleaning beforehand by any chance? It is good practice.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how spark handle the abnormal values

2022-05-02 Thread Mich Talebzadeh
Agg and ave are numeric functions dealing with the numeric values. Why is
column number defined as String type?

Do you perform data cleaning beforehand by any chance? It is good practice.

Alternatively you can use the rlike() function to filter rows that have
numeric values in a column..


scala> val data = Seq((1,"123456","123456"),

 |   (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH"))

data: Seq[(Int, String, String)] = List((1,123456,123456),
(2,3456234,ABCD12345), (3,48973456,ABCDEFGH))


scala> val df = data.toDF("id","all_numeric","alphanumeric")

df: org.apache.spark.sql.DataFrame = [id: int, all_numeric: string ... 1
more field]


scala> df.show()

+---+---++

| id|all_numeric|alphanumeric|

+---+---++

|  1| 123456|  123456|

|  2|3456234|   ABCD12345|

|  3|   48973456|ABCDEFGH|

+---+---++

scala> df.filter(col("alphanumeric")
 | .rlike("^[0-9]*$")
 |   ).show()
+---+---++
| id|all_numeric|alphanumeric|
+---+---++
|  1| 123456|  123456|
+---+---++


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 2 May 2022 at 01:21, wilson  wrote:

> I did a small test as follows.
>
> scala> df.printSchema()
> root
>   |-- fruit: string (nullable = true)
>   |-- number: string (nullable = true)
>
>
> scala> df.show()
> +--+--+
> | fruit|number|
> +--+--+
> | apple| 2|
> |orange| 5|
> |cherry| 7|
> |  plum|   xyz|
> +--+--+
>
>
> scala> df.agg(avg("number")).show()
> +-+
> |  avg(number)|
> +-+
> |4.667|
> +-+
>
>
> As you see, the "number" column is string type, and there is a abnormal
> value in it.
>
> But for these two cases spark still handles the result pretty well. So I
> guess:
>
> 1) spark can make some auto translation from string to numeric when
> aggregating.
> 2) spark ignore those abnormal values automatically when calculating the
> relevant stuff.
>
> Am I right? thank you.
>
> wilson
>
>
>
>
> wilson wrote:
> > my dataset has abnormal values in the column whose normal values are
> > numeric. I can select them as:
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>