Read by default can't be parallelized in a Spark job, and doing your own multi-threaded programming in a Spark program isn't a good idea.  Adding fast disk I/O and increase RAM may speed things up, but won't help with parallelization. You may have to be more creative here.  One option would be, If each file or groups of files can be processed independently, you can create a script or program on the client side to spawn multiple jobs and achieve parallel processing that way...

On 10/3/22 7:29 PM, Henrik Pang wrote:
you may need a large cluster memory and fast disk IO.


Sachit Murarka wrote:
Can anyone please suggest if there is any property to improve the parallel reads? I am reading more than 25000 files .



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to