Stop using collect for this purpose. Either continue your further
processing in spark (maybe you need to use streaming), or sink the data to
something that can accept the data (gcs/s3/azure
storage/redshift/elasticsearch/whatever), and have further processing read
from that sink.
On Thu, Sep 5,
Hi.
I have been trying to collect a large dataset(about 2 gb in size, 30
columns, more than a million rows) onto the driver side. I am aware that
collecting such a huge dataset isn't suggested, however, the application
within which the spark driver is running requires that data.
While collecting