Marco Costantini <marco.costant...@rocketfncl.com>
5:55 PM (5 minutes ago)
to user
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 100000 Orders in the orders table.

I have to use pyspark to generate a statement of Orders for each User. So,
a single user will need his/her own list of Orders. Additionally, I need to
send this statement to the real-world user via email (for example).

My first intuition was to apply a DataFrame.foreach() on the users
DataFrame. This way, I can rely on the spark workers to handle the email
sending individually. However, I now do not know the best way to get each
User's Orders.

I will soon try the following (pseudo-code):

```
users_df = <my entire users DataFrame>
orders_df = <my entire orders DataFrame>

#this is poorly named for max understandability in this context
def foreach_function(row):
  user_id = row.user_id
  user_orders_df = orders_df.select(f'user_id = {user_id}')

  #here, I'd get any User info from 'row'
  #then, I'd convert all 'user_orders' to JSON
  #then, I'd prepare the email and send it

users_df.foreach(foreach_function)
```

It is my understanding that if I do my user-specific work in the foreach
function, I will capitalize on Spark's scalability when doing that work.
However, I am worried of two things:

If I take all Orders up front...

Will that work?
Will I be taking too much? Will I be taking Orders on partitions who won't
handle them (different User).

If I create the orders_df (filtered) within the foreach function...

Will it work?
Will that be too much IO to DB?

The question ultimately is: How can I achieve this goal efficiently?

I have not yet tried anything here. I am doing so as we speak, but am
suffering from choice-paralysis.

Please and thank you.

Reply via email to