That would help but again in a particular partitions i would need to a iterate over the customers having first n letters of user id in that partition. I want to get rid of nested iterations.
Thanks On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com> wrote: > You can partitioned on the first n letters of userid > > On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> wrote: > >> Hi, >> >> I have a use case, where we have 1000 csv files with a column user_Id, >> having 8 million unique users. The data contains: userid,date,transaction, >> where we run some queries. >> >> We have a case where we need to iterate for each transaction in a >> particular date for each user. There is three nesting loops >> >> for(user){ >> for(date){ >> for(transactions){ >> //Do Something >> } >> } >> } >> >> i.e we do similar thing for every (date,transaction) tuple for a >> particular user. In order to get away with loop structure and decrease the >> processing time We are converting converting the csv files to parquet and >> partioning it with userid, df.write.format("parquet").par >> titionBy("useridcol").save("hdfs://path"). >> >> So that while reading the parquet files, we read a particular user in a >> particular partition and create a Cartesian product of (date X transaction) >> and work on the tuple in each partition, to achieve the above level of >> nesting. Partitioning on 8 million users is it a bad option. What could be >> a better way to achieve this? >> >> Thanks >> >> >> > >