Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Klaus Ma
Hi team, I'm kube-batch/Volcano founder, and I'm excited to hear that the spark community also has such requirements :) Volcano provides several features for batch workload, e.g. fair-share, queue, reservation, preemption/reclaim and so on. It has been used in several product environments with

Re: Performance Problems Migrating to S3A Committers

2021-06-23 Thread Artemis User
Thanks Johnny for sharing your experience.  Have you tried to use S3A committer?  Looks like this one is introduced in the latest Hadoop for solving problems with other committers. https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html - ND On 6/22/21 6:41 PM,

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
Looks like repartitioning was my friend, seems to be distributed across the cluster now. All good. Thanks! On Wed, Jun 23, 2021 at 2:18 PM Tom Barber wrote: > Okay so I tried another idea which was to use a real simple class to drive > a mapPartitions... because logic in my head seems to

Re: Parquet Metadata

2021-06-23 Thread Sam
Hi, I only know about comments which you can add to each column where you can add these key values. Thanks. On Wed, Jun 23, 2021 at 11:31 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi folks, > > > > Maybe not the right audience but maybe you came along such an requirement.

Parquet Metadata

2021-06-23 Thread Bode, Meikel, NMA-CFD
Hi folks, Maybe not the right audience but maybe you came along such an requirement. Is it possible to define a parquet schema, that contains technical column names and a list of translations for a certain column name into different languages? I give an example: Technical: "custnr" would

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Mich Talebzadeh
Please allow me to be diverse and express a different point of view on this roadmap. I believe from a technical point of view spending time and effort plus talent on batch scheduling on Kubernetes could be rewarding. However, if I may say I doubt whether such an approach and the so-called

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
Okay so I tried another idea which was to use a real simple class to drive a mapPartitions... because logic in my head seems to suggest that I want to map my partitions... @SerialVersionUID(100L) class RunCrawl extends Serializable{ def mapCrawl(x: Iterator[(String, Iterable[Resource])], job:

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
(I should point out that I'm diagnosing this by looking at the active tasks https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me know) On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote: > Uff hello fine people. > > So the cause of the above issue was, unsurprisingly,

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
Uff hello fine people. So the cause of the above issue was, unsurprisingly, human error. I found a local[*] spark master config which gazumped my own one so mystery solved. But I have another question, that is still the crux of this problem: Here's a bit of trimmed code, that I'm