Re: Collecting large dataset

2019-09-05 Thread Marcin Tustin
Stop using collect for this purpose. Either continue your further processing in spark (maybe you need to use streaming), or sink the data to something that can accept the data (gcs/s3/azure storage/redshift/elasticsearch/whatever), and have further processing read from that sink. On Thu, Sep 5,

Collecting large dataset

2019-09-05 Thread Rishikesh Gawade
Hi. I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data. While collecting