You can partitioned on the first n letters of userid On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> wrote:
> Hi, > > I have a use case, where we have 1000 csv files with a column user_Id, > having 8 million unique users. The data contains: userid,date,transaction, > where we run some queries. > > We have a case where we need to iterate for each transaction in a > particular date for each user. There is three nesting loops > > for(user){ > for(date){ > for(transactions){ > //Do Something > } > } > } > > i.e we do similar thing for every (date,transaction) tuple for a > particular user. In order to get away with loop structure and decrease the > processing time We are converting converting the csv files to parquet and > partioning it with userid, df.write.format("parquet"). > partitionBy("useridcol").save("hdfs://path"). > > So that while reading the parquet files, we read a particular user in a > particular partition and create a Cartesian product of (date X transaction) > and work on the tuple in each partition, to achieve the above level of > nesting. Partitioning on 8 million users is it a bad option. What could be > a better way to achieve this? > > Thanks > > >