Re: partitionBy with partitioned column in output?
Yeah, was just discussing this with a co-worker and came to the same conclusion -- need to essentially create a copy of the partition column. Thanks. Hacky, but it works. Seems counter-intuitive that Spark would remove the column from the output... should at least give you an option to keep it. On Mon, Feb 26, 2018 at 5:47 PM, naresh Goudwrote: > is this helps? > > sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map((" > foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json("json-out") > > > On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky > wrote: > >> Is there a way to make outputs created with "partitionBy" to contain the >> partitioned column? When reading the output with Spark or Hive or similar, >> it's less of an issue because those tools know how to perform partition >> discovery. But if I were to load the output into an external data warehouse >> or database, it would have no idea. >> >> Example below -- a dataframe with two columns "foo" and "bar" is >> partitioned by "foo", but the data only contains "bar", since it expects >> the reader to know how to derive the value of "foo" from the parent >> directory. Note that it's the same thing with Parquet and Avro as well, I >> just chose to use JSON in my example. >> >> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write. >> partitionBy("foo").json("json-out") >> >> >> $ ls json-out/ >> foo=1 foo=2 _SUCCESS >> $ cat json-out/foo=1/part-3-18ca93d0-c3b1-424b-8ad5-291d8a2952 >> 3b.json >> {"bar":10} >> $ cat json-out/foo=2/part-7-18ca93d0-c3b1-424b-8ad5-291d8a2952 >> 3b.json >> {"bar":20} >> >> Thanks, >> Alex. >> > >
Re: partitionBy with partitioned column in output?
is this helps? sc.parallelize(List((1,10),(2, 20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))). partitionBy("foo").json("json-out") On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetskywrote: > Is there a way to make outputs created with "partitionBy" to contain the > partitioned column? When reading the output with Spark or Hive or similar, > it's less of an issue because those tools know how to perform partition > discovery. But if I were to load the output into an external data warehouse > or database, it would have no idea. > > Example below -- a dataframe with two columns "foo" and "bar" is > partitioned by "foo", but the data only contains "bar", since it expects > the reader to know how to derive the value of "foo" from the parent > directory. Note that it's the same thing with Parquet and Avro as well, I > just chose to use JSON in my example. > > scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write. > partitionBy("foo").json("json-out") > > > $ ls json-out/ > foo=1 foo=2 _SUCCESS > $ cat json-out/foo=1/part-3-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json > {"bar":10} > $ cat json-out/foo=2/part-7-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json > {"bar":20} > > Thanks, > Alex. >
partitionBy with partitioned column in output?
Is there a way to make outputs created with "partitionBy" to contain the partitioned column? When reading the output with Spark or Hive or similar, it's less of an issue because those tools know how to perform partition discovery. But if I were to load the output into an external data warehouse or database, it would have no idea. Example below -- a dataframe with two columns "foo" and "bar" is partitioned by "foo", but the data only contains "bar", since it expects the reader to know how to derive the value of "foo" from the parent directory. Note that it's the same thing with Parquet and Avro as well, I just chose to use JSON in my example. scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.partitionBy("foo").json("json-out") $ ls json-out/ foo=1 foo=2 _SUCCESS $ cat json-out/foo=1/part-3-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json {"bar":10} $ cat json-out/foo=2/part-7-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json {"bar":20} Thanks, Alex.