pyspark - Use Spark to generate a large dataset on the fly

Sreyan Chakravarty Mon, 18 Mar 2024 10:51:01 -0700

Hi,

I have a specific problem where I have to get the data from REST APIs and
store it, and then do some transformations on it and then write to a RDBMS
table.


I am wondering if Spark will help in this regard.

I am confused as to how do I store the data while I actually acquire it on
the driver node ?

Is there any way I can partition the data "on the fly"(ie. during the
acquisition)  ?

Here are two ways how I think this can be done:

Approach 1:

Run a loop on the driver node to collect all the data via HTTP requests and
then create a dataframe from it.

Problem: It will result in an OOM in the driver node as the data is so
large that it needs to be spread out. How do I do that ?

Approach 2:

Make an app and push the data received from the REST APIs to a Kafka topic.

Use Spark's structured streaming to read from that topic.

Problem: How will Spark know how to partition the data from the Kafka topic
?

*Basically, my problem means calls from sending each piece of data as I
receive it to the worker node. Can that be done somehow ?*
-- 
Regards,
Sreyan Chakravarty

pyspark - Use Spark to generate a large dataset on the fly

Reply via email to