is this helps? sc.parallelize(List((1,10),(2, 20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))). partitionBy("foo").json("json-out")
On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky <alex.nastet...@verve.com> wrote: > Is there a way to make outputs created with "partitionBy" to contain the > partitioned column? When reading the output with Spark or Hive or similar, > it's less of an issue because those tools know how to perform partition > discovery. But if I were to load the output into an external data warehouse > or database, it would have no idea. > > Example below -- a dataframe with two columns "foo" and "bar" is > partitioned by "foo", but the data only contains "bar", since it expects > the reader to know how to derive the value of "foo" from the parent > directory. Note that it's the same thing with Parquet and Avro as well, I > just chose to use JSON in my example. > > scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write. > partitionBy("foo").json("json-out") > > > $ ls json-out/ > foo=1 foo=2 _SUCCESS > $ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json > {"bar":10} > $ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json > {"bar":20} > > Thanks, > Alex. >