Hi Marcelo, Maybe the spark.sql.functions.explode give what you need?
// Bruno > Le 6 juin 2019 à 16:02, Marcelo Valle <marcelo.va...@ktech.com> a écrit : > > Generating the city id (child) is easy, monotonically increasing id worked > for me. > > The problem is the country (parent) which has to be in both countries and > cities data frames. > > > > On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson <ma...@kth.se > <mailto:ma...@kth.se>> wrote: > Well, you could do a repartition on cityname/nrOfCities and use the > spark_partition_id function or the mappartitionswithindex dataframe method to > add a city Id column. Then just split the dataframe into two subsets. Be > careful of hashcollisions on the reparition Key though, or more than one city > might end up in the same partition (you can use a custom partitioner). > > It all depends on what kind of Id you want/need for the city value. I.e. will > you later need to append new city Id:s or not. Do you always handle the > entire dataset when you make this change or not. > > On the other hand, getting a distinct list of citynames is a non shuffling > fast operation, add a row_number column and do a broadcast join with the > original dataset and then split into two subsets. Probably a bit faster than > reshuffling the entire dataframe. As always the proof is in the pudding. > > //Magnus > > On Thu, Jun 6, 2019 at 2:53 PM Marcelo Valle <marcelo.va...@ktech.com > <mailto:marcelo.va...@ktech.com>> wrote: > Akshay, > > First of all, thanks for the answer. I *am* using monotonically increasing > id, but that's not my problem. > My problem is I want to output 2 tables from 1 data frame, 1 parent table > with ID for the group by and 1 child table with the parent id without the > group by. > > I was able to solve this problem by grouping by, generating a parent data > frame with an id, then joining the parent dataframe with the original one to > get a child dataframe with a parent id. > > I would like to find a solution without this second join, though. > > Thanks, > Marcelo. > > > On Thu, 6 Jun 2019 at 10:49, Akshay Bhardwaj <akshay.bhardwaj1...@gmail.com > <mailto:akshay.bhardwaj1...@gmail.com>> wrote: > Hi Marcelo, > > If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt > function "monotonically_increasing_id" in Spark. > A little tweaking using Spark sql inbuilt functions can enable you to achieve > this without having to write code or define RDDs with map/reduce functions. > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.va...@ktech.com > <mailto:marcelo.va...@ktech.com>> wrote: > Hi all, > > I am new to spark and I am trying to write an application using dataframes > that normalize data. > > So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY, > CITY_NICKNAME > > Here is what I want to do: > > Map by country, then for each country generate a new ID and write to a new > dataframe `countries`, which would have COUNTRY_ID, COUNTRY - country ID > would be generated, probably using `monotonically_increasing_id`. > For each country, write several lines on a new dataframe `cities`, which > would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be the same > generated on country table and ID would be another ID I generate. > What's the best way to do this, hopefully using only dataframes (no low level > RDDs) unless it's not possible? > > I clearly see a MAP/Reduce process where for each KEY mapped I generate a row > in countries table with COUNTRY_ID and for every value I write a row in > cities table. But how to implement this in an easy and efficient way? > > I thought about using a `GroupBy Country` and then using `collect` to collect > all values for that country, but then I don't know how to generate the > country id and I am not sure about memory efficiency of `collect` for a > country with too many cities (bare in mind country/city is just an example, > my real entities are different). > > Could anyone point me to the direction of a good solution? > > Thanks, > Marcelo. > > This email is confidential [and may be protected by legal privilege]. If you > are not the intended recipient, please do not copy or disclose its content > but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United > Kingdom > > > This email is confidential [and may be protected by legal privilege]. If you > are not the intended recipient, please do not copy or disclose its content > but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United > Kingdom > > > This email is confidential [and may be protected by legal privilege]. If you > are not the intended recipient, please do not copy or disclose its content > but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United > Kingdom >