Re: Pyspark Partitioning

2018-10-04 Thread Vitaliy Pisarev
Groupby is an operator you would use if you wanted to *aggregate* the values that are grouped by rhe specify key. In your case you want to retain access to the values. You need to do df.partitionBy and then you can map the partirions. Of course you need to be carefull of potential skews in the

Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone, Here is an issue that i am facing in partitioning dtafarame. I have a dataframe which called data_df. It is look like: Group_Id | Object_Id | Trajectory 1 | obj1| Traj1 2 | obj2| Traj2 1 | obj3| Traj3 3 |

Re: Pyspark Partitioning

2018-10-01 Thread Gourav Sengupta
Hi, the most simple option is create UDF's of these different functions and then use case statement (or similar) in SQL and pass it on. But this is low tech, in case you have conditions based on record values which are even more granular, why not use a single UDF, and then let conditions handle

Re: Pyspark Partitioning

2018-09-30 Thread ayan guha
Hi There are a set pf finction which can be used with the construct Over (partition by col order by col). You search for rank and window functions in spark documentation. On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari wrote: > Hi Dimitris, > > I believe the methods partitionBy >

Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris, I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames

Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone, I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below Group_Id | Id | Points 1| id1| Point1 2| id2| Point2 I want to have a partition for every