Re: Partitioning strategy

2017-04-02 Thread Jörn Franke
You can always repartition, but maybe for your use case different rdds with the 
same data, but different partition strategies could make sense. It may also 
make sense to choose an appropriate format on disc (orc, parquet). You have to 
choose based also on the users' non-functional requirements.

> On 2. Apr 2017, at 12:32, <jasbir.s...@accenture.com> 
> <jasbir.s...@accenture.com> wrote:
> 
> Hi,
>  
> I have RDD with 4 years’ data with suppose 20 partitions. On runtime, user 
> can decide to select few months or years of RDD. That means, based upon user 
> time selection RDD is being filtered and on filtered RDD further 
> transformations and actions are performed. And, as spark says, child RDD get 
> partitions from parent RDD.
>  
> Therefore, is there any way to decide partitioning strategy after filter 
> operations?
>  
> Regards,
> Jasbir Singh
> 
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Where allowed by local law, 
> electronic communications with Accenture and its affiliates, including e-mail 
> and instant messaging (including content), may be scanned by our systems for 
> the purposes of information security and assessment of internal compliance 
> with Accenture policy. 
> __
> 
> www.accenture.com


Partitioning strategy

2017-04-02 Thread jasbir.sing
Hi,

I have RDD with 4 years’ data with suppose 20 partitions. On runtime, user can 
decide to select few months or years of RDD. That means, based upon user time 
selection RDD is being filtered and on filtered RDD further transformations and 
actions are performed. And, as spark says, child RDD get partitions from parent 
RDD.

Therefore, is there any way to decide partitioning strategy after filter 
operations?

Regards,
Jasbir Singh



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


Re: Spark Partitioning Strategy with Parquet

2016-12-30 Thread titli batali
Yeah, it works for me.

Thanks

On Fri, Nov 18, 2016 at 3:08 AM, ayan guha  wrote:

> Hi
>
> I think you can use map reduce paradigm here. Create a key  using user ID
> and date and record as a value. Then you can express your operation (do
> something) part as a function. If the function meets certain criteria such
> as associative and cumulative like, say Add or multiplication, you can use
> reducebykey, else you may use groupbykey.
>
> HTH
> On 18 Nov 2016 06:45, "titli batali"  wrote:
>
>>
>> That would help but again in a particular partitions i would need to a
>> iterate over the customers having first n letters of user id in that
>> partition. I want to get rid of nested iterations.
>>
>> Thanks
>>
>> On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan 
>> wrote:
>>
>>> You can partitioned on the first n letters of userid
>>>
>>> On 17 November 2016 at 08:25, titli batali 
>>> wrote:
>>>
 Hi,

 I have a use case, where we have 1000 csv files with a column user_Id,
 having 8 million unique users. The data contains: userid,date,transaction,
 where we run some queries.

 We have a case where we need to iterate for each transaction in a
 particular date for each user. There is three nesting loops

 for(user){
   for(date){
 for(transactions){
   //Do Something
   }
}
 }

 i.e we do similar thing for every (date,transaction) tuple for a
 particular user. In order to get away with loop structure and decrease the
 processing time We are converting converting the csv files to parquet and
 partioning it with userid, df.write.format("parquet").par
 titionBy("useridcol").save("hdfs://path").

 So that while reading the parquet files, we read a particular user in a
 particular partition and create a Cartesian product of (date X transaction)
 and work on the tuple in each partition, to achieve the above level of
 nesting. Partitioning on 8 million users is it a bad option. What could be
 a better way to achieve this?

 Thanks



>>>
>>>
>>


Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread ayan guha
Hi

I think you can use map reduce paradigm here. Create a key  using user ID
and date and record as a value. Then you can express your operation (do
something) part as a function. If the function meets certain criteria such
as associative and cumulative like, say Add or multiplication, you can use
reducebykey, else you may use groupbykey.

HTH
On 18 Nov 2016 06:45, "titli batali"  wrote:

>
> That would help but again in a particular partitions i would need to a
> iterate over the customers having first n letters of user id in that
> partition. I want to get rid of nested iterations.
>
> Thanks
>
> On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan  wrote:
>
>> You can partitioned on the first n letters of userid
>>
>> On 17 November 2016 at 08:25, titli batali  wrote:
>>
>>> Hi,
>>>
>>> I have a use case, where we have 1000 csv files with a column user_Id,
>>> having 8 million unique users. The data contains: userid,date,transaction,
>>> where we run some queries.
>>>
>>> We have a case where we need to iterate for each transaction in a
>>> particular date for each user. There is three nesting loops
>>>
>>> for(user){
>>>   for(date){
>>> for(transactions){
>>>   //Do Something
>>>   }
>>>}
>>> }
>>>
>>> i.e we do similar thing for every (date,transaction) tuple for a
>>> particular user. In order to get away with loop structure and decrease the
>>> processing time We are converting converting the csv files to parquet and
>>> partioning it with userid, df.write.format("parquet").par
>>> titionBy("useridcol").save("hdfs://path").
>>>
>>> So that while reading the parquet files, we read a particular user in a
>>> particular partition and create a Cartesian product of (date X transaction)
>>> and work on the tuple in each partition, to achieve the above level of
>>> nesting. Partitioning on 8 million users is it a bad option. What could be
>>> a better way to achieve this?
>>>
>>> Thanks
>>>
>>>
>>>
>>
>>
>


Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
That would help but again in a particular partitions i would need to a
iterate over the customers having first n letters of user id in that
partition. I want to get rid of nested iterations.

Thanks

On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan  wrote:

> You can partitioned on the first n letters of userid
>
> On 17 November 2016 at 08:25, titli batali  wrote:
>
>> Hi,
>>
>> I have a use case, where we have 1000 csv files with a column user_Id,
>> having 8 million unique users. The data contains: userid,date,transaction,
>> where we run some queries.
>>
>> We have a case where we need to iterate for each transaction in a
>> particular date for each user. There is three nesting loops
>>
>> for(user){
>>   for(date){
>> for(transactions){
>>   //Do Something
>>   }
>>}
>> }
>>
>> i.e we do similar thing for every (date,transaction) tuple for a
>> particular user. In order to get away with loop structure and decrease the
>> processing time We are converting converting the csv files to parquet and
>> partioning it with userid, df.write.format("parquet").par
>> titionBy("useridcol").save("hdfs://path").
>>
>> So that while reading the parquet files, we read a particular user in a
>> particular partition and create a Cartesian product of (date X transaction)
>> and work on the tuple in each partition, to achieve the above level of
>> nesting. Partitioning on 8 million users is it a bad option. What could be
>> a better way to achieve this?
>>
>> Thanks
>>
>>
>>
>
>


Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread Xiaomeng Wan
You can partitioned on the first n letters of userid

On 17 November 2016 at 08:25, titli batali  wrote:

> Hi,
>
> I have a use case, where we have 1000 csv files with a column user_Id,
> having 8 million unique users. The data contains: userid,date,transaction,
> where we run some queries.
>
> We have a case where we need to iterate for each transaction in a
> particular date for each user. There is three nesting loops
>
> for(user){
>   for(date){
> for(transactions){
>   //Do Something
>   }
>}
> }
>
> i.e we do similar thing for every (date,transaction) tuple for a
> particular user. In order to get away with loop structure and decrease the
> processing time We are converting converting the csv files to parquet and
> partioning it with userid, df.write.format("parquet").
> partitionBy("useridcol").save("hdfs://path").
>
> So that while reading the parquet files, we read a particular user in a
> particular partition and create a Cartesian product of (date X transaction)
> and work on the tuple in each partition, to achieve the above level of
> nesting. Partitioning on 8 million users is it a bad option. What could be
> a better way to achieve this?
>
> Thanks
>
>
>


Fwd: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
Hi,

I have a use case, where we have 1000 csv files with a column user_Id,
having 8 million unique users. The data contains: userid,date,transaction,
where we run some queries.

We have a case where we need to iterate for each transaction in a
particular date for each user. There is three nesting loops

for(user){
  for(date){
for(transactions){
  //Do Something
  }
   }
}

i.e we do similar thing for every (date,transaction) tuple for a particular
user. In order to get away with loop structure and decrease the processing
time We are converting converting the csv files to parquet and partioning
it with userid,
df.write.format("parquet").partitionBy("useridcol").save("hdfs://path").

So that while reading the parquet files, we read a particular user in a
particular partition and create a Cartesian product of (date X transaction)
and work on the tuple in each partition, to achieve the above level of
nesting. Partitioning on 8 million users is it a bad option. What could be
a better way to achieve this?

Thanks