Re: What is the best way to organize a join within a foreach?

2023-04-27 Thread Amit Joshi
Hi Marco, I am not sure if you will get access to data frame inside the for each, as spark context used to be non serialized, if I remember correctly. One thing you can do. Use cogroup operation on both the dataset. This will help you have (Key- iter(v1),itr(V2). And then use for each partition

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Again one try is worth many opinions. Try it and gather matrix from spark UI and see how it performs. On Wed, 26 Apr 2023 at 14:57, Marco Costantini < marco.costant...@rocketfncl.com> wrote: > Thanks team, > Email was just an example. The point was to illustrate that some actions > could be

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
Thanks team, Email was just an example. The point was to illustrate that some actions could be chained using Spark's foreach. In reality, this is an S3 write and a Kafka message production, which I think is quite reasonable for spark to do. To answer Ayan's first question. Yes, all a users

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Indeed very valid points by Ayan. How email is going to handle 1000s of records. As a solution architect I tend to replace. Users by customers and for each order there must be products sort of many to many relationship. If I was a customer I would also be interested in product details as

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread ayan guha
Adding to what Mitch said, 1. Are you trying to send statements of all orders to all users? Or the latest order only? 2. Sending email is not a good use of spark. instead, I suggest to use a notification service or function. Spark should write to a queue (kafka, sqs...pick your choice here).

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Well OK in a nutshell you want the result set for every user prepared and email to that user right. This is a form of ETL where those result sets need to be posted somewhere. Say you create a table based on the result set prepared for each user. You may have many raw target tables at the end of

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Hi Mich, First, thank you for that. Great effort put into helping. Second, I don't think this tackles the technical challenge here. I understand the windowing as it serves those ranks you created, but I don't see how the ranks contribute to the solution. Third, the core of the challenge is about

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, First thoughts. foreach() is an action operation that is to iterate/loop over each element in the dataset, meaning cursor based. That is different from operating over the dataset as a set which is far more efficient. So in your case as I understand it correctly, you want to get order

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, > > Let us start

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, Let us start simple, Provide a csv file of 5 rows for the users table. Each row has a unique user_id and one or two other columns like fictitious email etc. Also for each user_id, provide 10 rows of orders table, meaning that orders table has 5 x 10 rows for each user_id. both as

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, I have not but I will certainly read up on this today. To your point that all of the essential data is in the 'orders' table; I agree! That distills the problem nicely. Yet, I still have some questions on which someone may be able to shed some light. 1) If my 'orders' table is very

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Have you thought of using windowing function s to achieve this? Effectively all your information is in the orders table. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need his/her own list of Orders. Additionally, I need

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
Marco Costantini 5:55 PM (5 minutes ago) to user I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need