That would help but again in a particular partitions i would need to a
iterate over the customers having first n letters of user id in that
partition. I want to get rid of nested iterations.

Thanks

On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com> wrote:

> You can partitioned on the first n letters of userid
>
> On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a use case, where we have 1000 csv files with a column user_Id,
>> having 8 million unique users. The data contains: userid,date,transaction,
>> where we run some queries.
>>
>> We have a case where we need to iterate for each transaction in a
>> particular date for each user. There is three nesting loops
>>
>> for(user){
>>               for(date){
>>                             for(transactions){
>>                               //Do Something
>>                               }
>>                }
>>     }
>>
>> i.e we do similar thing for every (date,transaction) tuple for a
>> particular user. In order to get away with loop structure and decrease the
>> processing time We are converting converting the csv files to parquet and
>> partioning it with userid, df.write.format("parquet").par
>> titionBy("useridcol").save("hdfs://path").
>>
>> So that while reading the parquet files, we read a particular user in a
>> particular partition and create a Cartesian product of (date X transaction)
>> and work on the tuple in each partition, to achieve the above level of
>> nesting. Partitioning on 8 million users is it a bad option. What could be
>> a better way to achieve this?
>>
>> Thanks
>>
>>
>>
>
>

Reply via email to