Hi Marcelo, If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt function "monotonically_increasing_id" in Spark. A little tweaking using Spark sql inbuilt functions can enable you to achieve this without having to write code or define RDDs with map/reduce functions.
Akshay Bhardwaj +91-97111-33849 On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.va...@ktech.com> wrote: > Hi all, > > I am new to spark and I am trying to write an application using dataframes > that normalize data. > > So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, > CITY, CITY_NICKNAME > > Here is what I want to do: > > > 1. Map by country, then for each country generate a new ID and write > to a new dataframe `countries`, which would have COUNTRY_ID, COUNTRY - > country ID would be generated, probably using > `monotonically_increasing_id`. > 2. For each country, write several lines on a new dataframe `cities`, > which would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be > the same generated on country table and ID would be another ID I generate. > > What's the best way to do this, hopefully using only dataframes (no low > level RDDs) unless it's not possible? > > I clearly see a MAP/Reduce process where for each KEY mapped I generate a > row in countries table with COUNTRY_ID and for every value I write a row in > cities table. But how to implement this in an easy and efficient way? > > I thought about using a `GroupBy Country` and then using `collect` to > collect all values for that country, but then I don't know how to generate > the country id and I am not sure about memory efficiency of `collect` for a > country with too many cities (bare in mind country/city is just an example, > my real entities are different). > > Could anyone point me to the direction of a good solution? > > Thanks, > Marcelo. > > This email is confidential [and may be protected by legal privilege]. If > you are not the intended recipient, please do not copy or disclose its > content but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, > United Kingdom >