How can I bucketize / group a DataFrame from parquet files?

Brandon White Mon, 25 Apr 2016 23:09:08 -0700

I am creating a dataFrame from parquet files. The schema is based on the
parquet files, I do not know it before hand. What I want to do is group the
entire DF into buckets based on a column.


val df = sqlContext.read.parquet("/path/to/files")
val groupedBuckets: DataFrame[String, Array[Rows]] =
df.groupBy($"columnName")

I know this does not work because the DataFrame's groupBy is only used for
aggregate functions. I cannot convert my DataFrame to a DataSet because I
do not have a case class for the DataSet schema. The only thing I can do is
convert the df to an RDD[Rows] and try to deal with the types. This is ugly
and difficult.

Is there any better way? Can I convert a DataFrame to a DataSet without a
predefined case class?

Brandon

How can I bucketize / group a DataFrame from parquet files?

Reply via email to