Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
What do you mean by overkill here? I tried the below way to iterate over 4k records under a while loop. However, it runs for the only first record. What could be wrong here? I am going through few SO blogs where user found the below approach faster than withColumn approach : finalDF = finalDF.sel

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
Hi, >> spark.range(1).createOrReplaceTempView("test") >> maximum_records_per_api_call = 40 >> batch_count = spark.sql("SELECT * FROM test").count() / maximum_records_per_api_call >> spark.sql("SELECT id, mod(monotonically_increasing_id() / batch_count) batch_id FROM test).repartitionByRange("b

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
Hi Gourav, Could you please provide me with some examples? On Mon, Jun 13, 2022 at 2:23 PM Gourav Sengupta wrote: > Hi, > > try to use mod of a monotonically increasing field and then use > repartitionbyrange function, and see whether SPARK automatically serialises > it based on the number of e

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
Hi, try to use mod of a monotonically increasing field and then use repartitionbyrange function, and see whether SPARK automatically serialises it based on the number of executors that you put in the job. But once again, this is kind of an overkill, for fetching data from a API, creating a simple

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
Hi Gourav, Do you have any examples or links, please? That would help me to understand. Thanks, Sid On Mon, Jun 13, 2022 at 1:42 PM Gourav Sengupta wrote: > Hi, > I think that serialising data using spark is an overkill, why not use > normal python. > > Also have you tried repartition by range

Re: Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Gourav Sengupta
Hi, I think that serialising data using spark is an overkill, why not use normal python. Also have you tried repartition by range, that way you can use modulus operator to batch things up? Regards, Gourav On Mon, Jun 13, 2022 at 8:37 AM Sid wrote: > Hi Team, > > I am trying to hit the POST AP

Redesign approach for hitting the APIs using PySpark

2022-06-13 Thread Sid
Hi Team, I am trying to hit the POST APIs for the very first time using Pyspark. My end goal is to achieve is something like the below: 1. Generate the data 2. Send the data in the batch of 4k records in one batch since the API can accept the 4k records at once. 3. The record would