Yeah, was just discussing this with a co-worker and came to the same conclusion -- need to essentially create a copy of the partition column. Thanks.
Hacky, but it works. Seems counter-intuitive that Spark would remove the column from the output... should at least give you an option to keep it. On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud <nareshgoud.du...@gmail.com> wrote: > is this helps? > > sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map((" > foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json("json-out") > > > On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky <alex.nastet...@verve.com> > wrote: > >> Is there a way to make outputs created with "partitionBy" to contain the >> partitioned column? When reading the output with Spark or Hive or similar, >> it's less of an issue because those tools know how to perform partition >> discovery. But if I were to load the output into an external data warehouse >> or database, it would have no idea. >> >> Example below -- a dataframe with two columns "foo" and "bar" is >> partitioned by "foo", but the data only contains "bar", since it expects >> the reader to know how to derive the value of "foo" from the parent >> directory. Note that it's the same thing with Parquet and Avro as well, I >> just chose to use JSON in my example. >> >> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write. >> partitionBy("foo").json("json-out") >> >> >> $ ls json-out/ >> foo=1 foo=2 _SUCCESS >> $ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a2952 >> 3b.json >> {"bar":10} >> $ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a2952 >> 3b.json >> {"bar":20} >> >> Thanks, >> Alex. >> > >