Re: Rdd - zip with index

Sean Owen Tue, 23 Mar 2021 20:12:35 -0700

It would split 10GB of CSV into multiple partitions by default, unless it's
gzipped. Something else is going on here.


‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yur...@gmail.com> wrote:‬

> I’m not Spark core developer and do not want to confuse you but it seems
> logical to me that just reading from single file (no matter what format of
> the file is used) gives no parallelism unless you do repartition by some
> column just after csv load, but the if you’re telling you’ve already tried
> repartition with no luck...
>
>
> > On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
> wrote:
> >
> > So spark by default doesn’t split the large 10gb file when loaded?
> >
> > Sent from my iPhone
> >
> >> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) <
> yur...@gmail.com> wrote:
> >>
> >> Hi, Mohammed
> >> I think that the reason that only one executor is running and have
> single partition is because you have single file that might be read/loaded
> into memory.
> >>
> >> In order to achieve better parallelism I’d suggest to split the csv
> file.
> >>
>
>

Re: Rdd - zip with index

Reply via email to