Hi, I have a use case, where we have 1000 csv files with a column user_Id, having 8 million unique users. The data contains: userid,date,transaction, where we run some queries.
We have a case where we need to iterate for each transaction in a particular date for each user. There is three nesting loops for(user){ for(date){ for(transactions){ //Do Something } } } i.e we do similar thing for every (date,transaction) tuple for a particular user. In order to get away with loop structure and decrease the processing time We are converting converting the csv files to parquet and partioning it with userid, df.write.format("parquet").partitionBy("useridcol").save("hdfs://path"). So that while reading the parquet files, we read a particular user in a particular partition and create a Cartesian product of (date X transaction) and work on the tuple in each partition, to achieve the above level of nesting. Partitioning on 8 million users is it a bad option. What could be a better way to achieve this? Thanks