Creating Spark buckets that Presto / Athena / Hive can leverage

2019-06-15 Thread Daniel Mateus Pires
Hi there! I am trying to optimize joins on data created by Spark, so I'd like to bucket the data to avoid shuffling. I am writing to immutable partitions every day by writing data to a local HDFS and then copying this data to S3, is there a combination of bucketBy options and DDL that I can use

Blog post: DataFrame.transform -- Spark function composition

2019-06-05 Thread Daniel Mateus Pires
Hi everyone! I just published this blog post on how Spark Scala custom transformations can be re-arranged to better be composed and used within .transform: https://medium.com/@dmateusp/dataframe-transform-spark-function-composition-eb8ec296c108 I found the discussions in this group to be

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires
the brand! spark.read.json(df2.select("value").as[String]).show /* +---+---+ | a| c| +---+---+ | b| d| +---+---+ */ ``` Ideally I'd like something similar to spark.read.json that would keep the partitioning values and merge it with the rest o