Re: Partitioning strategy

2017-04-02 Thread Jörn Franke
DD further > transformations and actions are performed. And, as spark says, child RDD get > partitions from parent RDD. > > Therefore, is there any way to decide partitioning strategy after filter > operations? > > Regards, > Jasbir Singh > > > This message i

Partitioning strategy

2017-04-02 Thread jasbir.sing
get partitions from parent RDD. Therefore, is there any way to decide partitioning strategy after filter operations? Regards, Jasbir Singh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential

Re: Spark Partitioning Strategy with Parquet

2016-12-30 Thread titli batali
Yeah, it works for me. Thanks On Fri, Nov 18, 2016 at 3:08 AM, ayan guha wrote: > Hi > > I think you can use map reduce paradigm here. Create a key using user ID > and date and record as a value. Then you can express your operation (do > something) part as a function. If the function meets cer

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread ayan guha
Hi I think you can use map reduce paradigm here. Create a key using user ID and date and record as a value. Then you can express your operation (do something) part as a function. If the function meets certain criteria such as associative and cumulative like, say Add or multiplication, you can use

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
That would help but again in a particular partitions i would need to a iterate over the customers having first n letters of user id in that partition. I want to get rid of nested iterations. Thanks On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan wrote: > You can partitioned on the first n letter

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread Xiaomeng Wan
You can partitioned on the first n letters of userid On 17 November 2016 at 08:25, titli batali wrote: > Hi, > > I have a use case, where we have 1000 csv files with a column user_Id, > having 8 million unique users. The data contains: userid,date,transaction, > where we run some queries. > > We

Fwd: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
Hi, I have a use case, where we have 1000 csv files with a column user_Id, having 8 million unique users. The data contains: userid,date,transaction, where we run some queries. We have a case where we need to iterate for each transaction in a particular date for each user. There is three nesting