Hi,

I have a use case, where we have 1000 csv files with a column user_Id,
having 8 million unique users. The data contains: userid,date,transaction,
where we run some queries.

We have a case where we need to iterate for each transaction in a
particular date for each user. There is three nesting loops

for(user){
              for(date){
                            for(transactions){
                              //Do Something
                              }
               }
    }

i.e we do similar thing for every (date,transaction) tuple for a particular
user. In order to get away with loop structure and decrease the processing
time We are converting converting the csv files to parquet and partioning
it with userid,
df.write.format("parquet").partitionBy("useridcol").save("hdfs://path").

So that while reading the parquet files, we read a particular user in a
particular partition and create a Cartesian product of (date X transaction)
and work on the tuple in each partition, to achieve the above level of
nesting. Partitioning on 8 million users is it a bad option. What could be
a better way to achieve this?

Thanks

Reply via email to