Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if required & store it in a file where your normal python function would ingest the data for the url & after X amount of api calls, create dataframe for it & union with previous dataframe, finally writing the content & then doing a join with the original df based on url, if required.
If this is absolutely necessary, here are a few ways to achieve this: Approach-1: You can use the spark's foreachPartition <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.foreachPartition.html> which will require a udf function. In this, you can create a connection to limit the API calls per partition. This can work if you introduce logic that checks for the current number of partitions & then distribute the max_api_calls per partition. eg: if no_of_partitions = 4 and total_max_api_calls = 4, then you can pass in a parameter to this udf with max_partition_api_calls = 1. This approach has limitations as it requires max allowed api calls to be more than that of the number of partitions. Approach-2 An alternative approach is to create the connection outside of the udf with rate limiter (link <https://stackoverflow.com/questions/40748687/python-api-rate-limiting-how-to-limit-api-calls-globally>) and use this connection variable inside of the udf function in each partition, invoking time.sleep. This will definitely introduce issues where many partitions are trying to invoke the api. I found this medium-article <https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78> which discusses the issue you are facing, but does not discuss a solution for the same. Do check the comments also Regards, Varun On Sat, Aug 26, 2023 at 10:32 AM Harry Jamison <harryjamiso...@yahoo.com.invalid> wrote: > I am using python 3.7 and Spark 2.4.7 > > I am not sure what the best way to do this is. > > I have a dataframe with a url in one of the columns, and I want to > download the contents of that url and put it in a new column. > > Can someone point me in the right direction on how to do this? > I looked at the UDFs and they seem confusing to me. > > Also, is there a good way to rate limit the number of calls I make per > second? >