Marco Costantini <[email protected]>
5:55 PM (5 minutes ago)
to user
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 100000 Orders in the orders table.
I have to use pyspark to generate a statement of Orders for each User. So,
a single user will need his/her own list of Orders. Additionally, I need to
send this statement to the real-world user via email (for example).
My first intuition was to apply a DataFrame.foreach() on the users
DataFrame. This way, I can rely on the spark workers to handle the email
sending individually. However, I now do not know the best way to get each
User's Orders.
I will soon try the following (pseudo-code):
```
users_df = <my entire users DataFrame>
orders_df = <my entire orders DataFrame>
#this is poorly named for max understandability in this context
def foreach_function(row):
user_id = row.user_id
user_orders_df = orders_df.select(f'user_id = {user_id}')
#here, I'd get any User info from 'row'
#then, I'd convert all 'user_orders' to JSON
#then, I'd prepare the email and send it
users_df.foreach(foreach_function)
```
It is my understanding that if I do my user-specific work in the foreach
function, I will capitalize on Spark's scalability when doing that work.
However, I am worried of two things:
If I take all Orders up front...
Will that work?
Will I be taking too much? Will I be taking Orders on partitions who won't
handle them (different User).
If I create the orders_df (filtered) within the foreach function...
Will it work?
Will that be too much IO to DB?
The question ultimately is: How can I achieve this goal efficiently?
I have not yet tried anything here. I am doing so as we speak, but am
suffering from choice-paralysis.
Please and thank you.