Re: adding a column to a groupBy (dataframe)

Bruno Nassivet Thu, 06 Jun 2019 12:58:33 -0700

Hi Marcelo,

Maybe the spark.sql.functions.explode give what you need?


// Bruno


> Le 6 juin 2019 à 16:02, Marcelo Valle <marcelo.va...@ktech.com> a écrit :
> 
> Generating the city id (child) is easy, monotonically increasing id worked 
> for me. 
> 
> The problem is the country (parent) which has to be in both countries and 
> cities data frames.
> 
> 
> 
> On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson <ma...@kth.se 
> <mailto:ma...@kth.se>> wrote:
> Well, you could do a repartition on cityname/nrOfCities and use the 
> spark_partition_id function or the mappartitionswithindex dataframe method to 
> add a city Id column. Then just split the dataframe into two subsets. Be 
> careful of hashcollisions on the reparition Key though, or more than one city 
> might end up in the same partition (you can use a custom partitioner).
> 
> It all depends on what kind of Id you want/need for the city value. I.e. will 
> you later need to append new city Id:s or not. Do you always handle the 
> entire dataset when you make this change or not.
> 
> On the other hand, getting a distinct list of citynames is a non shuffling 
> fast operation, add a row_number column and do a broadcast join with the 
> original dataset and then split into two subsets. Probably a bit faster than 
> reshuffling the entire dataframe. As always the proof is in the pudding.
> 
> //Magnus
> 
> On Thu, Jun 6, 2019 at 2:53 PM Marcelo Valle <marcelo.va...@ktech.com 
> <mailto:marcelo.va...@ktech.com>> wrote:
> Akshay, 
> 
> First of all, thanks for the answer. I *am* using monotonically increasing 
> id, but that's not my problem. 
> My problem is I want to output 2 tables from 1 data frame, 1 parent table 
> with ID for the group by and 1 child table with the parent id without the 
> group by.
> 
> I was able to solve this problem by grouping by, generating a parent data 
> frame with an id, then joining the parent dataframe with the original one to 
> get a child dataframe with a parent id. 
> 
> I would like to find a solution without this second join, though.
> 
> Thanks,
> Marcelo.
> 
> 
> On Thu, 6 Jun 2019 at 10:49, Akshay Bhardwaj <akshay.bhardwaj1...@gmail.com 
> <mailto:akshay.bhardwaj1...@gmail.com>> wrote:
> Hi Marcelo,
> 
> If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt 
> function "monotonically_increasing_id" in Spark.
> A little tweaking using Spark sql inbuilt functions can enable you to achieve 
> this without having to write code or define RDDs with map/reduce functions.
> 
> Akshay Bhardwaj
> +91-97111-33849
> 
> 
> On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.va...@ktech.com 
> <mailto:marcelo.va...@ktech.com>> wrote:
> Hi all, 
> 
> I am new to spark and I am trying to write an application using dataframes 
> that normalize data. 
> 
> So I have a dataframe `denormalized_cities` with 3 columns:  COUNTRY, CITY, 
> CITY_NICKNAME
> 
> Here is what I want to do: 
> 
> Map by country, then for each country generate a new ID and write to a new 
> dataframe `countries`, which would have COUNTRY_ID, COUNTRY - country ID 
> would be generated, probably using `monotonically_increasing_id`.
> For each country, write several lines on a new dataframe `cities`, which 
> would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be the same 
> generated on country table and ID would be another ID I generate. 
> What's the best way to do this, hopefully using only dataframes (no low level 
> RDDs) unless it's not possible?
> 
> I clearly see a MAP/Reduce process where for each KEY mapped I generate a row 
> in countries table with COUNTRY_ID and for every value I write a row in 
> cities table. But how to implement this in an easy and efficient way? 
> 
> I thought about using a `GroupBy Country` and then using `collect` to collect 
> all values for that country, but then I don't know how to generate the 
> country id and I am not sure about memory efficiency of `collect` for a 
> country with too many cities (bare in mind country/city is just an example, 
> my real entities are different).
> 
> Could anyone point me to the direction of a good solution?
> 
> Thanks,
> Marcelo.
> 
> This email is confidential [and may be protected by legal privilege]. If you 
> are not the intended recipient, please do not copy or disclose its content 
> but contact the sender immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
> Kingdom
> 
> 
> This email is confidential [and may be protected by legal privilege]. If you 
> are not the intended recipient, please do not copy or disclose its content 
> but contact the sender immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
> Kingdom
> 
> 
> This email is confidential [and may be protected by legal privilege]. If you 
> are not the intended recipient, please do not copy or disclose its content 
> but contact the sender immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
> Kingdom
>

Re: adding a column to a groupBy (dataframe)

Reply via email to