subject:"pyspark \- Use Spark to generate a large dataset on the fly"

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty

Hi,

I have a specific problem where I have to get the data from REST APIs and
store it, and then do some transformations on it and then write to a RDBMS
table.

I am wondering if Spark will help in this regard.

I am confused as to how do I store the data while I actually acquire it on
the driver node ?

Is there any way I can partition the data "on the fly"(ie. during the
acquisition)  ?

Here are two ways how I think this can be done:

Approach 1:

Run a loop on the driver node to collect all the data via HTTP requests and
then create a dataframe from it.

Problem: It will result in an OOM in the driver node as the data is so
large that it needs to be spread out. How do I do that ?

Approach 2:

Make an app and push the data received from the REST APIs to a Kafka topic.

Use Spark's structured streaming to read from that topic.

Problem: How will Spark know how to partition the data from the Kafka topic
?

Basically, my problem means calls from sending each piece of data as I
receive it to the worker node. Can that be done somehow ?

-- 
Regards,
Sreyan Chakravarty

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty

Hi,

I have a specific problem where I have to get the data from REST APIs and
store it, and then do some transformations on it and then write to a RDBMS
table.

I am wondering if Spark will help in this regard.

I am confused as to how do I store the data while I actually acquire it on
the driver node ?

Is there any way I can partition the data "on the fly"(ie. during the
acquisition)  ?

Here are two ways how I think this can be done:

Approach 1:

Run a loop on the driver node to collect all the data via HTTP requests and
then create a dataframe from it.

Problem: It will result in an OOM in the driver node as the data is so
large that it needs to be spread out. How do I do that ?

Approach 2:

Make an app and push the data received from the REST APIs to a Kafka topic.

Use Spark's structured streaming to read from that topic.

Problem: How will Spark know how to partition the data from the Kafka topic
?

*Basically, my problem means calls from sending each piece of data as I
receive it to the worker node. Can that be done somehow ?*
-- 
Regards,
Sreyan Chakravarty

pyspark - Use Spark to generate a large dataset on the fly

pyspark - Use Spark to generate a large dataset on the fly

2 matches

Site Navigation

Mail list logo

Footer information