date:20231211

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков

Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi Michail, > > with

unsubscribe

2023-12-11 Thread Klaus Schaefers

-- “Overfitting” is not about an excessive amount of physical exercise...

unsubscribe

2023-12-11 Thread Sergey Boytsov

-- Sergei Boitsov JetBrains GmbH Christoph-Rapparini-Bogen 23 80639 München Handelsregister: Amtsgericht München, HRB 187151 Geschäftsführer: Yury Belyaev

Spark 3.1.3 with Hive dynamic partitions fails while driver moves the staged files

2023-12-11 Thread Shay Elbaz

Hi all, Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert new (time) partitions into existing Hive tables. But we see too many failures coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where the driver moves the successful files from

Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky

Hey Mich, Thanks for the detailed response. I get most of these options. However, what we are trying to do is avoid having to upload the source configs and pyspark.zip files to the cluster every time we execute the job using spark-submit. Here is the code that does it:

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh

Hi Eugene, With regard to your points What are the PYTHONPATH and SPARK_HOME env variables in your script? OK let us look at a typical of my Spark project structure - project_root |-- README.md |-- __init__.py |-- conf | |-- (configuration files for Spark) |-- deployment | |--

Unsubscribe

2023-12-11 Thread 18706753459

Unsubscribe

Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells

Hello Spark experts - I’m running Spark jobs in cluster mode using a dedicated cluster for each job. Is there a way to see how much compute time each job takes via Spark APIs, metrics, etc.? In case it makes a difference, I’m using AWS EMR - I’d ultimately like to be able to say this job costs $X