Is there a way to make outputs created with "partitionBy" to contain the
partitioned column? When reading the output with Spark or Hive or similar,
it's less of an issue because those tools know how to perform partition
discovery. But if I were to load the output into an external data warehouse
or database, it would have no idea.

Example below -- a dataframe with two columns "foo" and "bar" is
partitioned by "foo", but the data only contains "bar", since it expects
the reader to know how to derive the value of "foo" from the parent
directory. Note that it's the same thing with Parquet and Avro as well, I
just chose to use JSON in my example.

scala>
sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.partitionBy("foo").json("json-out")


$ ls json-out/
foo=1  foo=2  _SUCCESS
$ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
{"bar":10}
$ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
{"bar":20}

Thanks,
Alex.

Reply via email to