Re: partitionBy with partitioned column in output?

naresh Goud Mon, 26 Feb 2018 14:47:25 -0800

is this helps?

sc.parallelize(List((1,10),(2,
20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))).
partitionBy("foo").json("json-out")



On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky <alex.nastet...@verve.com>
wrote:

> Is there a way to make outputs created with "partitionBy" to contain the
> partitioned column? When reading the output with Spark or Hive or similar,
> it's less of an issue because those tools know how to perform partition
> discovery. But if I were to load the output into an external data warehouse
> or database, it would have no idea.
>
> Example below -- a dataframe with two columns "foo" and "bar" is
> partitioned by "foo", but the data only contains "bar", since it expects
> the reader to know how to derive the value of "foo" from the parent
> directory. Note that it's the same thing with Parquet and Avro as well, I
> just chose to use JSON in my example.
>
> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.
> partitionBy("foo").json("json-out")
>
>
> $ ls json-out/
> foo=1  foo=2  _SUCCESS
> $ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":10}
> $ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":20}
>
> Thanks,
> Alex.
>

Re: partitionBy with partitioned column in output?

Reply via email to