Re: partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky
Yeah, was just discussing this with a co-worker and came to the same
conclusion -- need to essentially create a copy of the partition column.
Thanks.

Hacky, but it works. Seems counter-intuitive that Spark would remove the
column from the output... should at least give you an option to keep it.

On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud 
wrote:

> is this helps?
>
> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map(("
> foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json("json-out")
>
>
> On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky 
> wrote:
>
>> Is there a way to make outputs created with "partitionBy" to contain the
>> partitioned column? When reading the output with Spark or Hive or similar,
>> it's less of an issue because those tools know how to perform partition
>> discovery. But if I were to load the output into an external data warehouse
>> or database, it would have no idea.
>>
>> Example below -- a dataframe with two columns "foo" and "bar" is
>> partitioned by "foo", but the data only contains "bar", since it expects
>> the reader to know how to derive the value of "foo" from the parent
>> directory. Note that it's the same thing with Parquet and Avro as well, I
>> just chose to use JSON in my example.
>>
>> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.
>> partitionBy("foo").json("json-out")
>>
>>
>> $ ls json-out/
>> foo=1  foo=2  _SUCCESS
>> $ cat json-out/foo=1/part-3-18ca93d0-c3b1-424b-8ad5-291d8a2952
>> 3b.json
>> {"bar":10}
>> $ cat json-out/foo=2/part-7-18ca93d0-c3b1-424b-8ad5-291d8a2952
>> 3b.json
>> {"bar":20}
>>
>> Thanks,
>> Alex.
>>
>
>


Re: partitionBy with partitioned column in output?

2018-02-26 Thread naresh Goud
is this helps?

sc.parallelize(List((1,10),(2,
20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))).
partitionBy("foo").json("json-out")


On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky 
wrote:

> Is there a way to make outputs created with "partitionBy" to contain the
> partitioned column? When reading the output with Spark or Hive or similar,
> it's less of an issue because those tools know how to perform partition
> discovery. But if I were to load the output into an external data warehouse
> or database, it would have no idea.
>
> Example below -- a dataframe with two columns "foo" and "bar" is
> partitioned by "foo", but the data only contains "bar", since it expects
> the reader to know how to derive the value of "foo" from the parent
> directory. Note that it's the same thing with Parquet and Avro as well, I
> just chose to use JSON in my example.
>
> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.
> partitionBy("foo").json("json-out")
>
>
> $ ls json-out/
> foo=1  foo=2  _SUCCESS
> $ cat json-out/foo=1/part-3-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":10}
> $ cat json-out/foo=2/part-7-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":20}
>
> Thanks,
> Alex.
>