Hey Enrico it does help to understand it, thanks for explaining.
Regarding this comment
> PySpark and Scala should behave identically here
Is it ok that Scala and PySpark optimization works differently in this case?
вт, 5 дек. 2023 г. в 20:08, Enrico Minack :
> Hi Michail,
>
> with
--
“Overfitting” is not about an excessive amount of physical exercise...
--
Sergei Boitsov
JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München
Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Yury Belyaev
Hi all,
Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert
new (time) partitions into existing Hive tables. But we see too many failures
coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where
the driver moves the successful files from
Hey Mich,
Thanks for the detailed response. I get most of these options.
However, what we are trying to do is avoid having to upload the source
configs and pyspark.zip files to the cluster every time we execute the job
using spark-submit. Here is the code that does it:
Hi Eugene,
With regard to your points
What are the PYTHONPATH and SPARK_HOME env variables in your script?
OK let us look at a typical of my Spark project structure
- project_root
|-- README.md
|-- __init__.py
|-- conf
| |-- (configuration files for Spark)
|-- deployment
| |--
Unsubscribe
Hello Spark experts - I’m running Spark jobs in cluster mode using a
dedicated cluster for each job. Is there a way to see how much compute time
each job takes via Spark APIs, metrics, etc.? In case it makes a
difference, I’m using AWS EMR - I’d ultimately like to be able to say this
job costs $X
Unsubscribe
unsubscribe
10 matches
Mail list logo