Hi there!
I am trying to optimize joins on data created by Spark, so I'd like to
bucket the data to avoid shuffling.
I am writing to immutable partitions every day by writing data to a local
HDFS and then copying this data to S3, is there a combination of bucketBy
options and DDL that I can use
Hi everyone!
I just published this blog post on how Spark Scala custom transformations
can be re-arranged to better be composed and used within .transform:
https://medium.com/@dmateusp/dataframe-transform-spark-function-composition-eb8ec296c108
I found the discussions in this group to be
the brand!
spark.read.json(df2.select("value").as[String]).show
/*
+---+---+
| a| c|
+---+---+
| b| d|
+---+---+
*/
```
Ideally I'd like something similar to spark.read.json that would keep the
partitioning values and merge it with the rest o