You can partitioned on the first n letters of userid
On 17 November 2016 at 08:25, titli batali <[email protected]> wrote:
> Hi,
>
> I have a use case, where we have 1000 csv files with a column user_Id,
> having 8 million unique users. The data contains: userid,date,transaction,
> where we run some queries.
>
> We have a case where we need to iterate for each transaction in a
> particular date for each user. There is three nesting loops
>
> for(user){
> for(date){
> for(transactions){
> //Do Something
> }
> }
> }
>
> i.e we do similar thing for every (date,transaction) tuple for a
> particular user. In order to get away with loop structure and decrease the
> processing time We are converting converting the csv files to parquet and
> partioning it with userid, df.write.format("parquet").
> partitionBy("useridcol").save("hdfs://path").
>
> So that while reading the parquet files, we read a particular user in a
> particular partition and create a Cartesian product of (date X transaction)
> and work on the tuple in each partition, to achieve the above level of
> nesting. Partitioning on 8 million users is it a bad option. What could be
> a better way to achieve this?
>
> Thanks
>
>
>