Re: Spark Partitioning Strategy with Parquet

ayan guha Thu, 17 Nov 2016 13:39:24 -0800

Hi

I think you can use map reduce paradigm here. Create a key  using user ID
and date and record as a value. Then you can express your operation (do
something) part as a function. If the function meets certain criteria such
as associative and cumulative like, say Add or multiplication, you can use
reducebykey, else you may use groupbykey.


HTH
On 18 Nov 2016 06:45, "titli batali" <titlibat...@gmail.com> wrote:

>
> That would help but again in a particular partitions i would need to a
> iterate over the customers having first n letters of user id in that
> partition. I want to get rid of nested iterations.
>
> Thanks
>
> On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com> wrote:
>
>> You can partitioned on the first n letters of userid
>>
>> On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a use case, where we have 1000 csv files with a column user_Id,
>>> having 8 million unique users. The data contains: userid,date,transaction,
>>> where we run some queries.
>>>
>>> We have a case where we need to iterate for each transaction in a
>>> particular date for each user. There is three nesting loops
>>>
>>> for(user){
>>>               for(date){
>>>                             for(transactions){
>>>                               //Do Something
>>>                               }
>>>                }
>>>     }
>>>
>>> i.e we do similar thing for every (date,transaction) tuple for a
>>> particular user. In order to get away with loop structure and decrease the
>>> processing time We are converting converting the csv files to parquet and
>>> partioning it with userid, df.write.format("parquet").par
>>> titionBy("useridcol").save("hdfs://path").
>>>
>>> So that while reading the parquet files, we read a particular user in a
>>> particular partition and create a Cartesian product of (date X transaction)
>>> and work on the tuple in each partition, to achieve the above level of
>>> nesting. Partitioning on 8 million users is it a bad option. What could be
>>> a better way to achieve this?
>>>
>>> Thanks
>>>
>>>
>>>
>>
>>
>

Re: Spark Partitioning Strategy with Parquet

Reply via email to