Hi I think you can use map reduce paradigm here. Create a key using user ID and date and record as a value. Then you can express your operation (do something) part as a function. If the function meets certain criteria such as associative and cumulative like, say Add or multiplication, you can use reducebykey, else you may use groupbykey.
HTH On 18 Nov 2016 06:45, "titli batali" <titlibat...@gmail.com> wrote: > > That would help but again in a particular partitions i would need to a > iterate over the customers having first n letters of user id in that > partition. I want to get rid of nested iterations. > > Thanks > > On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com> wrote: > >> You can partitioned on the first n letters of userid >> >> On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a use case, where we have 1000 csv files with a column user_Id, >>> having 8 million unique users. The data contains: userid,date,transaction, >>> where we run some queries. >>> >>> We have a case where we need to iterate for each transaction in a >>> particular date for each user. There is three nesting loops >>> >>> for(user){ >>> for(date){ >>> for(transactions){ >>> //Do Something >>> } >>> } >>> } >>> >>> i.e we do similar thing for every (date,transaction) tuple for a >>> particular user. In order to get away with loop structure and decrease the >>> processing time We are converting converting the csv files to parquet and >>> partioning it with userid, df.write.format("parquet").par >>> titionBy("useridcol").save("hdfs://path"). >>> >>> So that while reading the parquet files, we read a particular user in a >>> particular partition and create a Cartesian product of (date X transaction) >>> and work on the tuple in each partition, to achieve the above level of >>> nesting. Partitioning on 8 million users is it a bad option. What could be >>> a better way to achieve this? >>> >>> Thanks >>> >>> >>> >> >> >