It would split 10GB of CSV into multiple partitions by default, unless it's gzipped. Something else is going on here.
On Tue, Mar 23, 2021 at 10:04 PM "Yuri Oleynikov (יורי אולייניקוב)" < yur...@gmail.com> wrote: > I’m not Spark core developer and do not want to confuse you but it seems > logical to me that just reading from single file (no matter what format of > the file is used) gives no parallelism unless you do repartition by some > column just after csv load, but the if you’re telling you’ve already tried > repartition with no luck... > > > > On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > > > So spark by default doesn’t split the large 10gb file when loaded? > > > > Sent from my iPhone > > > >> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (יורי אולייניקוב) < > yur...@gmail.com> wrote: > >> > >> Hi, Mohammed > >> I think that the reason that only one executor is running and have > single partition is because you have single file that might be read/loaded > into memory. > >> > >> In order to achieve better parallelism I’d suggest to split the csv > file. > >> > >