Re: Spark 2.4.7

Varun Shah Sat, 26 Aug 2023 09:19:24 -0700

Hi Harry,

Ideally, you should not be fetching a url in your transformation job but do
the API calls separately (outside the cluster if possible). Ingesting data
should be treated separately from transformation / cleaning / join
operations. You can create another dataframe of urls, dedup if required &
store it in a file where your normal python function would ingest the data
for the url & after X amount of api calls, create dataframe for it & union
with previous dataframe, finally writing the content & then doing a join
with the original df based on url, if required.

If this is absolutely necessary, here are a few ways to achieve this:

Approach-1:
You can use the spark's foreachPartition
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.foreachPartition.html>
which will require a udf function.
In this, you can create a connection to limit the API calls per partition.

This can work if you introduce logic that checks for the current number of
partitions & then distribute the max_api_calls per partition.
eg: if no_of_partitions = 4 and total_max_api_calls = 4, then you can pass
in a parameter to this udf with max_partition_api_calls = 1.

This approach has limitations as it requires max allowed api calls to be
more than that of the number of partitions.

Approach-2
An alternative approach is to create the connection outside of the udf with
rate limiter
(link
<https://stackoverflow.com/questions/40748687/python-api-rate-limiting-how-to-limit-api-calls-globally>)
and use this connection variable inside of the udf function in each
partition, invoking time.sleep. This will definitely introduce issues where
many partitions are trying to invoke the api.

I found this medium-article
<https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78>
which discusses the issue you are facing, but does not discuss a solution
for the same. Do check the comments also

Regards,
Varun

On Sat, Aug 26, 2023 at 10:32 AM Harry Jamison
<harryjamiso...@yahoo.com.invalid> wrote:

> I am using python 3.7 and Spark 2.4.7
>
> I am not sure what the best way to do this is.
>
> I have a dataframe with a url in one of the columns, and I want to
> download the contents of that url and put it in a new column.
>
> Can someone point me in the right direction on how to do this?
> I looked at the UDFs and they seem confusing to me.
>
> Also, is there a good way to rate limit the number of calls I make per
> second?
>

Re: Spark 2.4.7

Reply via email to