Hi,

I've recently noticed a bug in Spark (branch 1.6) that appears if you do
the following

Let's have some DataFrame called df.

1) Aggregation of multiple columns on the Dataframe df and store result as
result_agg_1
2) Do another aggregation of multiple columns, but on one less grouping
columns and store the result as result_agg_2
3) Align the result of second aggregation by adding missing grouping column
with value empty lit("")
4) Union result_agg_1 and result_agg_2
5) Do the projection from "sum(count_column)" to "count_column" for all
aggregated columns.

The result is structurally inconsistent DataFrame that has all the data
coming from result_agg_1 shifted.

An example of stripped down code and example result can be seen here:

https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
https://gist.github.com/xjrk58/7a297a42ebb94f300d96

Best,
Jiri Syrovy

Reply via email to