Read by default can't be parallelized in a Spark job, and doing your own
multi-threaded programming in a Spark program isn't a good idea. Adding
fast disk I/O and increase RAM may speed things up, but won't help with
parallelization. You may have to be more creative here. One option
would be, If each file or groups of files can be processed
independently, you can create a script or program on the client side to
spawn multiple jobs and achieve parallel processing that way...
On 10/3/22 7:29 PM, Henrik Pang wrote:
you may need a large cluster memory and fast disk IO.
Sachit Murarka wrote:
Can anyone please suggest if there is any property to improve the
parallel reads? I am reading more than 25000 files .
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org