Thanks Mich, I have not but I will certainly read up on this today. To your point that all of the essential data is in the 'orders' table; I agree! That distills the problem nicely. Yet, I still have some questions on which someone may be able to shed some light.
1) If my 'orders' table is very large, and will need to be aggregated by 'user_id', how will Spark intelligently optimize on that constraint (only read data for relevent 'user_id's). Is that something I have to instruct Spark to do? 2) Without #1, even with windowing, am I asking each partition to search too much? Please, if you have any links to documentation I can read on *how* Spark works under the hood for these operations, I would appreciate it if you give them. Spark has become a pillar on my team and knowing it in more detail is warranted. Slightly pivoting the subject here; I have tried something. It was a suggestion by an AI chat bot and it seemed reasonable. In my main Spark script I now have the line: ``` grouped_orders_df = orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id', 'timestamp', 'total', 'description'))).alias('orders')) ``` (json is ultimately needed) This actually achieves my goal by putting all of the 'orders' in a single Array column. Now my worry is, will this column become too large if there are a great many orders. Is there a limit? I have search for documentation on such a limit but could not find any. I truly appreciate your help Mich and team, Marco. On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Have you thought of using windowing function > <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to > achieve this? > > Effectively all your information is in the orders table. > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 25 Apr 2023 at 00:15, Marco Costantini < > marco.costant...@rocketfncl.com> wrote: > >> I have two tables: {users, orders}. In this example, let's say that for >> each 1 User in the users table, there are 100000 Orders in the orders table. >> >> I have to use pyspark to generate a statement of Orders for each User. >> So, a single user will need his/her own list of Orders. Additionally, I >> need to send this statement to the real-world user via email (for example). >> >> My first intuition was to apply a DataFrame.foreach() on the users >> DataFrame. This way, I can rely on the spark workers to handle the email >> sending individually. However, I now do not know the best way to get each >> User's Orders. >> >> I will soon try the following (pseudo-code): >> >> ``` >> users_df = <my entire users DataFrame> >> orders_df = <my entire orders DataFrame> >> >> #this is poorly named for max understandability in this context >> def foreach_function(row): >> user_id = row.user_id >> user_orders_df = orders_df.select(f'user_id = {user_id}') >> >> #here, I'd get any User info from 'row' >> #then, I'd convert all 'user_orders' to JSON >> #then, I'd prepare the email and send it >> >> users_df.foreach(foreach_function) >> ``` >> >> It is my understanding that if I do my user-specific work in the foreach >> function, I will capitalize on Spark's scalability when doing that work. >> However, I am worried of two things: >> >> If I take all Orders up front... >> >> Will that work? >> Will I be taking too much? Will I be taking Orders on partitions who >> won't handle them (different User). >> >> If I create the orders_df (filtered) within the foreach function... >> >> Will it work? >> Will that be too much IO to DB? >> >> The question ultimately is: How can I achieve this goal efficiently? >> >> I have not yet tried anything here. I am doing so as we speak, but am >> suffering from choice-paralysis. >> >> Please and thank you. >> >