Re: What is the best way to organize a join within a foreach?

Marco Costantini Tue, 25 Apr 2023 06:08:01 -0700

Thanks Mich,
I have not but I will certainly read up on this today.

To your point that all of the essential data is in the 'orders' table; I
agree! That distills the problem nicely. Yet, I still have some questions
on which someone may be able to shed some light.


1) If my 'orders' table is very large, and will need to be aggregated by
'user_id', how will Spark intelligently optimize on that constraint (only
read data for relevent 'user_id's). Is that something I have to instruct
Spark to do?

2) Without #1, even with windowing, am I asking each partition to search
too much?

Please, if you have any links to documentation I can read on *how* Spark
works under the hood for these operations, I would appreciate it if you
give them. Spark has become a pillar on my team and knowing it in more
detail is warranted.

Slightly pivoting the subject here; I have tried something. It was a
suggestion by an AI chat bot and it seemed reasonable. In my main Spark
script I now have the line:

```
grouped_orders_df =
orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
'timestamp', 'total', 'description'))).alias('orders'))
```
(json is ultimately needed)

This actually achieves my goal by putting all of the 'orders' in a single
Array column. Now my worry is, will this column become too large if there
are a great many orders. Is there a limit? I have search for documentation
on such a limit but could not find any.

I truly appreciate your help Mich and team,
Marco.


On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Have you thought of using  windowing function
> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
> achieve this?
>
> Effectively all your information is in the orders table.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
> marco.costant...@rocketfncl.com> wrote:
>
>> I have two tables: {users, orders}. In this example, let's say that for
>> each 1 User in the users table, there are 100000 Orders in the orders table.
>>
>> I have to use pyspark to generate a statement of Orders for each User.
>> So, a single user will need his/her own list of Orders. Additionally, I
>> need to send this statement to the real-world user via email (for example).
>>
>> My first intuition was to apply a DataFrame.foreach() on the users
>> DataFrame. This way, I can rely on the spark workers to handle the email
>> sending individually. However, I now do not know the best way to get each
>> User's Orders.
>>
>> I will soon try the following (pseudo-code):
>>
>> ```
>> users_df = <my entire users DataFrame>
>> orders_df = <my entire orders DataFrame>
>>
>> #this is poorly named for max understandability in this context
>> def foreach_function(row):
>>   user_id = row.user_id
>>   user_orders_df = orders_df.select(f'user_id = {user_id}')
>>
>>   #here, I'd get any User info from 'row'
>>   #then, I'd convert all 'user_orders' to JSON
>>   #then, I'd prepare the email and send it
>>
>> users_df.foreach(foreach_function)
>> ```
>>
>> It is my understanding that if I do my user-specific work in the foreach
>> function, I will capitalize on Spark's scalability when doing that work.
>> However, I am worried of two things:
>>
>> If I take all Orders up front...
>>
>> Will that work?
>> Will I be taking too much? Will I be taking Orders on partitions who
>> won't handle them (different User).
>>
>> If I create the orders_df (filtered) within the foreach function...
>>
>> Will it work?
>> Will that be too much IO to DB?
>>
>> The question ultimately is: How can I achieve this goal efficiently?
>>
>> I have not yet tried anything here. I am doing so as we speak, but am
>> suffering from choice-paralysis.
>>
>> Please and thank you.
>>
>

Re: What is the best way to organize a join within a foreach?

Reply via email to