Hi all,

I am new to spark and I am trying to write an application using dataframes
that normalize data.

So I have a dataframe `denormalized_cities` with 3 columns:  COUNTRY, CITY,
CITY_NICKNAME

Here is what I want to do:


   1. Map by country, then for each country generate a new ID and write to
   a new dataframe `countries`, which would have COUNTRY_ID, COUNTRY - country
   ID would be generated, probably using `monotonically_increasing_id`.
   2. For each country, write several lines on a new dataframe `cities`,
   which would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be
   the same generated on country table and ID would be another ID I generate.

What's the best way to do this, hopefully using only dataframes (no low
level RDDs) unless it's not possible?

I clearly see a MAP/Reduce process where for each KEY mapped I generate a
row in countries table with COUNTRY_ID and for every value I write a row in
cities table. But how to implement this in an easy and efficient way?

I thought about using a `GroupBy Country` and then using `collect` to
collect all values for that country, but then I don't know how to generate
the country id and I am not sure about memory efficiency of `collect` for a
country with too many cities (bare in mind country/city is just an example,
my real entities are different).

Could anyone point me to the direction of a good solution?

Thanks,
Marcelo.

This email is confidential [and may be protected by legal privilege]. If you 
are not the intended recipient, please do not copy or disclose its content but 
contact the sender immediately upon receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United 
Kingdom

Reply via email to