pyspark - Use Spark to generate a large dataset on the fly
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver node ? Is there any way I can partition the data "on the fly"(ie. during the acquisition) ? Here are two ways how I think this can be done: Approach 1: Run a loop on the driver node to collect all the data via HTTP requests and then create a dataframe from it. Problem: It will result in an OOM in the driver node as the data is so large that it needs to be spread out. How do I do that ? Approach 2: Make an app and push the data received from the REST APIs to a Kafka topic. Use Spark's structured streaming to read from that topic. Problem: How will Spark know how to partition the data from the Kafka topic ? Basically, my problem means calls from sending each piece of data as I receive it to the worker node. Can that be done somehow ? -- Regards, Sreyan Chakravarty
pyspark - Use Spark to generate a large dataset on the fly
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver node ? Is there any way I can partition the data "on the fly"(ie. during the acquisition) ? Here are two ways how I think this can be done: Approach 1: Run a loop on the driver node to collect all the data via HTTP requests and then create a dataframe from it. Problem: It will result in an OOM in the driver node as the data is so large that it needs to be spread out. How do I do that ? Approach 2: Make an app and push the data received from the REST APIs to a Kafka topic. Use Spark's structured streaming to read from that topic. Problem: How will Spark know how to partition the data from the Kafka topic ? *Basically, my problem means calls from sending each piece of data as I receive it to the worker node. Can that be done somehow ?* -- Regards, Sreyan Chakravarty