Hi Bruno, that's really interesting...
So, to use explode, I would have to do a group by on countries and a
collect_all on cities, then explode the cities, right? Am I understanding
the idea right?
I think this could produce the results I want. But what would be the
behaviour under the hood? Does
Hi Marcelo,
Maybe the spark.sql.functions.explode give what you need?
// Bruno
> Le 6 juin 2019 à 16:02, Marcelo Valle a écrit :
>
> Generating the city id (child) is easy, monotonically increasing id worked
> for me.
>
> The problem is the country (parent) which has to be in both countrie
ards,
>
> Magnus
> --
> *From:* Marcelo Valle
> *Sent:* Thursday, June 6, 2019 16:02
> *To:* Magnus Nilsson
> *Cc:* user @spark
> *Subject:* Re: adding a column to a groupBy (dataframe)
>
> Generating the city id (child) is easy, monotonica
Generating the city id (child) is easy, monotonically increasing id worked
for me.
The problem is the country (parent) which has to be in both countries and
cities data frames.
On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson wrote:
> Well, you could do a repartition on cityname/nrOfCities and use
Well, you could do a repartition on cityname/nrOfCities and use the
spark_partition_id function or the mappartitionswithindex dataframe method
to add a city Id column. Then just split the dataframe into two subsets. Be
careful of hashcollisions on the reparition Key though, or more than one
city mi
Akshay,
First of all, thanks for the answer. I *am* using monotonically increasing
id, but that's not my problem.
My problem is I want to output 2 tables from 1 data frame, 1 parent table
with ID for the group by and 1 child table with the parent id without the
group by.
I was able to solve this
Additionally there is "uuid" function available as well if that helps your
use case.
Akshay Bhardwaj
+91-97111-33849
On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:
> Hi Marcelo,
>
> If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
>
Hi Marcelo,
If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
inbuilt function "monotonically_increasing_id" in Spark.
A little tweaking using Spark sql inbuilt functions can enable you to
achieve this without having to write code or define RDDs with map/reduce
functions.
Aksh
Hi all,
I am new to spark and I am trying to write an application using dataframes
that normalize data.
So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY,
CITY_NICKNAME
Here is what I want to do:
1. Map by country, then for each country generate a new ID and write t