Dependency injection for spark executors

2023-04-20 Thread Deepak Patankar
I am writing a spark application which uses java and spring boot to process rows. For every row it performs some logic and saves data into the database. The logic is performed using some services defined in my application and some external

Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. >

Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Hi folks, We are writing a dataframe and doing a partitionby() on it. df.write.partitionBy('col').parquet('output') Job is running super slow because internally per partition it is doing a sort before starting to output to the final location. This sort isn't useful in any way since # of files