Hello
Is it possible to know a dataframe's total storage size in bytes? such
as:
df.size()
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1660, in
__getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__nam
Hello list,
I run with Spark 3.2.0
After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
I can't load from the module sparkmeasure:
from sparkmeasure import StageMetrics
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: N
Thanks Gourav and Luca. I will try with the tools you provide in the
Github.
On 2021-12-23 23:40, Luca Canali wrote:
Hi,
I agree with Gourav that just measuring execution time is a simplistic
approach that may lead you to miss important details, in particular
when running distributed computati
Thanks Luca,
I am still getting some error
* pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17*
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url
Hi Mich,
With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Best,
Luca
From: Mich Talebzadeh
Sent: Thursday, December 23, 2021 19:59
To: Luca Canali
Cc: user
Subject: Re: measure running ti
Hi Luca,
Have you tested this link https://github.com/LucaCanali/sparkMeasure
With Spark 3.1.1/PySpark, I am getting this error
pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17
:: problems summary ::
ERRORS
unknown resolver null
SERVER ERROR: Bad Ga
Hi Andrew,
Thanks, here's the Github repo to the code and the publication :
https://github.com/SamSmithDevs10/paperReplicationForReview
Kind regards
Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson a écrit :
> Hi Sam
>
>
>
> Can you tell us more? What is the algorithm? Can you send us the URL the
Hi Sam
Can you tell us more? What is the algorithm? Can you send us the URL the
publication
Kind regards
Andy
From: sam smith
Date: Wednesday, December 22, 2021 at 10:59 AM
To: "user@spark.apache.org"
Subject: About some Spark technical help
Hello guys,
I am replicating a paper's algorithm
Hi,
just trying to understand:
1. Are you using JDBC to consume data from HIVE?
2. Or are you reading data directly from S3 and just using HIVE Metastore
in SPARK just to find out where the table is stored and its metadata?
Regards,
Gourav Sengupta
On Thu, Dec 23, 2021 at 2:13 PM Arthur Li wro
Hi,
I agree with Gourav that just measuring execution time is a simplistic approach
that may lead you to miss important details, in particular when running
distributed computations.
WebUI, REST API, and metrics instrumentation in Spark can be quite useful for
further drill down. See https:/
Hi Arthur,
If you are using Spark 3.x you can use executor metrics for memory
instrumentation.
Metrics are available on the WebUI, see
https://spark.apache.org/docs/latest/web-ui.html#stage-detail (search for Peak
execution memory).
Memory execution metrics are available also in the REST AP
Dear experts,
Recently there’s some OOM issue in my demo jobs which consuming data from the
hive database, and I know I can increase the executor memory size to eliminate
the OOM error. While I don’t know how to do the executor memory assessment and
how to automatically adopt the executor memor
Hi,
I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and Dataframe
can be compared based on their start and stop time may not provide any
valid information.
You will have to look into the details of timing and the ste
Hello All,
I am replicating a paper's algorithm about a partitioning approach to
anonymize datasets with Spark / Java, and want to ask you for some help to
review my 150 lines of code. My github repo, attached below, contains both
my java class and the related paper:
https://github.com/SamSmithDe
Try this simple thing first
import time
def main():
start_time = time.time()
print("\nStarted at");uf.println(lst)
# your code
print("\nFinished at");uf.println(lst)
end_time = time.time()
time_elapsed = (end_time - start_time)
print(f"""Elapsed time in seconds is {time_ela
hello community,
In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe
API, in my this blog:
https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
I tried spark.time() it doe
16 matches
Mail list logo