Dataframe's storage size

2021-12-23 Thread bitfox
Hello Is it possible to know a dataframe's total storage size in bytes? such as: df.size() Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/dataframe.py", line 1660, in __getattr__ "'%s' object has no attribute '%s'" %

Re: measure running time

2021-12-23 Thread bitfox
Hello list, I run with Spark 3.2.0 After I started pyspark with: $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 I can't load from the module sparkmeasure: from sparkmeasure import StageMetrics Traceback (most recent call last): File "", line 1, in ModuleNotFoundError:

Re: measure running time

2021-12-23 Thread bitfox
Thanks Gourav and Luca. I will try with the tools you provide in the Github. On 2021-12-23 23:40, Luca Canali wrote: Hi, I agree with Gourav that just measuring execution time is a simplistic approach that may lead you to miss important details, in particular when running distributed

Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Thanks Luca, I am still getting some error * pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17* Python 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. :: loading settings ::

RE: measure running time

2021-12-23 Thread Luca Canali
Hi Mich, With Spark 3.1.1 you need to use spark-measure built with Scala 2.12: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 Best, Luca From: Mich Talebzadeh Sent: Thursday, December 23, 2021 19:59 To: Luca Canali Cc: user Subject: Re: measure running

Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Hi Luca, Have you tested this link https://github.com/LucaCanali/sparkMeasure With Spark 3.1.1/PySpark, I am getting this error pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17 :: problems summary :: ERRORS unknown resolver null SERVER ERROR: Bad

Re: About some Spark technical help

2021-12-23 Thread sam smith
Hi Andrew, Thanks, here's the Github repo to the code and the publication : https://github.com/SamSmithDevs10/paperReplicationForReview Kind regards Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson a écrit : > Hi Sam > > > > Can you tell us more? What is the algorithm? Can you send us the URL

Re: About some Spark technical help

2021-12-23 Thread Andrew Davidson
Hi Sam Can you tell us more? What is the algorithm? Can you send us the URL the publication Kind regards Andy From: sam smith Date: Wednesday, December 22, 2021 at 10:59 AM To: "user@spark.apache.org" Subject: About some Spark technical help Hello guys, I am replicating a paper's

Re: How to estimate the executor memory size according by the data

2021-12-23 Thread Gourav Sengupta
Hi, just trying to understand: 1. Are you using JDBC to consume data from HIVE? 2. Or are you reading data directly from S3 and just using HIVE Metastore in SPARK just to find out where the table is stored and its metadata? Regards, Gourav Sengupta On Thu, Dec 23, 2021 at 2:13 PM Arthur Li

RE: measure running time

2021-12-23 Thread Luca Canali
Hi, I agree with Gourav that just measuring execution time is a simplistic approach that may lead you to miss important details, in particular when running distributed computations. WebUI, REST API, and metrics instrumentation in Spark can be quite useful for further drill down. See

RE: How to estimate the executor memory size according by the data

2021-12-23 Thread Luca Canali
Hi Arthur, If you are using Spark 3.x you can use executor metrics for memory instrumentation. Metrics are available on the WebUI, see https://spark.apache.org/docs/latest/web-ui.html#stage-detail (search for Peak execution memory). Memory execution metrics are available also in the REST

How to estimate the executor memory size according by the data

2021-12-23 Thread Arthur Li
Dear experts, Recently there’s some OOM issue in my demo jobs which consuming data from the hive database, and I know I can increase the executor memory size to eliminate the OOM error. While I don’t know how to do the executor memory assessment and how to automatically adopt the executor

Re: measure running time

2021-12-23 Thread Gourav Sengupta
Hi, I do not think that such time comparisons make any sense at all in distributed computation. Just saying that an operation in RDD and Dataframe can be compared based on their start and stop time may not provide any valid information. You will have to look into the details of timing and the

dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith
Hello All, I am replicating a paper's algorithm about a partitioning approach to anonymize datasets with Spark / Java, and want to ask you for some help to review my 150 lines of code. My github repo, attached below, contains both my java class and the related paper:

Re: measure running time

2021-12-23 Thread Mich Talebzadeh
Try this simple thing first import time def main(): start_time = time.time() print("\nStarted at");uf.println(lst) # your code print("\nFinished at");uf.println(lst) end_time = time.time() time_elapsed = (end_time - start_time) print(f"""Elapsed time in seconds is

measure running time

2021-12-23 Thread bitfox
hello community, In pyspark how can I measure the running time to the command? I just want to compare the running time of the RDD API and dataframe API, in my this blog: https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/ I tried spark.time() it